The case study aims to identify patterns which indicate if a client has difficulty paying their installments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc. This will ensure that the consumers capable of repaying the loan are not rejected. Identification of such applicants using EDA is the aim of this case study.
In other words, the company wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default. The company can utilise this knowledge for its portfolio and risk assessment.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
import itertools
app=pd.read_csv("application_data.csv")
prev=pd.read_csv("previous_application.csv")
col=pd.read_csv("columns_description.csv",encoding='ISO-8859-1')
# The file I'm trying to read might not actually be encoded in utf-8. It could be using another encoding like ISO-8859-1
app.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -637 | -3648.0 | -2120 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.083037 | 0.262949 | 0.139376 | 0.0247 | 0.0369 | 0.9722 | 0.6192 | 0.0143 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0369 | 0.0202 | 0.0190 | 0.0000 | 0.0000 | 0.0252 | 0.0383 | 0.9722 | 0.6341 | 0.0144 | 0.0000 | 0.0690 | 0.0833 | 0.1250 | 0.0377 | 0.022 | 0.0198 | 0.0 | 0.0 | 0.0250 | 0.0369 | 0.9722 | 0.6243 | 0.0144 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0375 | 0.0205 | 0.0193 | 0.0000 | 0.00 | reg oper account | block of flats | 0.0149 | Stone, brick | No | 2.0 | 2.0 | 2.0 | 2.0 | -1134.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1188 | -1186.0 | -291 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2.0 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.311267 | 0.622246 | NaN | 0.0959 | 0.0529 | 0.9851 | 0.7960 | 0.0605 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0130 | 0.0773 | 0.0549 | 0.0039 | 0.0098 | 0.0924 | 0.0538 | 0.9851 | 0.8040 | 0.0497 | 0.0806 | 0.0345 | 0.2917 | 0.3333 | 0.0128 | 0.079 | 0.0554 | 0.0 | 0.0 | 0.0968 | 0.0529 | 0.9851 | 0.7987 | 0.0608 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0132 | 0.0787 | 0.0558 | 0.0039 | 0.01 | reg oper account | block of flats | 0.0714 | Block | No | 1.0 | 0.0 | 1.0 | 0.0 | -828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -225 | -4260.0 | -2531 | 26.0 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | NaN | 0.555912 | 0.729567 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -815.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -3039 | -9833.0 | -2437 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.0 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | NaN | 0.650442 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.0 | 0.0 | 2.0 | 0.0 | -617.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | -19932 | -3038 | -4311.0 | -3458 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 1.0 | 2 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 1 | 1 | Religion | NaN | 0.322738 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -1106.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
It contains all the information of the client at the time of application. The data is about whether a client has payment difficulties.
prev.head()
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | RATE_INTEREST_PRIMARY | RATE_INTEREST_PRIVILEGED | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_TYPE_SUITE | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 17145.0 | 0.0 | 17145.0 | SATURDAY | 15 | Y | 1 | 0.0 | 0.182832 | 0.867336 | XAP | Approved | -73 | Cash through the bank | XAP | NaN | Repeater | Mobile | POS | XNA | Country-wide | 35 | Connectivity | 12.0 | middle | POS mobile with interest | 365243.0 | -42.0 | 300.0 | -42.0 | -37.0 | 0.0 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 607500.0 | 679671.0 | NaN | 607500.0 | THURSDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -164 | XNA | XAP | Unaccompanied | Repeater | XNA | Cash | x-sell | Contact center | -1 | XNA | 36.0 | low_action | Cash X-Sell: low | 365243.0 | -134.0 | 916.0 | 365243.0 | 365243.0 | 1.0 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 112500.0 | 136444.5 | NaN | 112500.0 | TUESDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -301 | Cash through the bank | XAP | Spouse, partner | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | high | Cash X-Sell: high | 365243.0 | -271.0 | 59.0 | 365243.0 | 365243.0 | 1.0 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 450000.0 | 470790.0 | NaN | 450000.0 | MONDAY | 7 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -512 | Cash through the bank | XAP | NaN | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | middle | Cash X-Sell: middle | 365243.0 | -482.0 | -152.0 | -182.0 | -177.0 | 1.0 |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 337500.0 | 404055.0 | NaN | 337500.0 | THURSDAY | 9 | Y | 1 | NaN | NaN | NaN | Repairs | Refused | -781 | Cash through the bank | HC | NaN | Repeater | XNA | Cash | walk-in | Credit and cash offices | -1 | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
It contains information about the client’s previous loan data. It contains the data whether the previous application had been Approved, Cancelled, Refused or Unused offer.
col.head()
| Unnamed: 0 | Table | Row | Description | Special | |
|---|---|---|---|---|---|
| 0 | 1 | application_data | SK_ID_CURR | ID of loan in our sample | NaN |
| 1 | 2 | application_data | TARGET | Target variable (1 - client with payment diffi... | NaN |
| 2 | 5 | application_data | NAME_CONTRACT_TYPE | Identification if loan is cash or revolving | NaN |
| 3 | 6 | application_data | CODE_GENDER | Gender of the client | NaN |
| 4 | 7 | application_data | FLAG_OWN_CAR | Flag if the client owns a car | NaN |
It is data dictionary which describes the meaning of the variables.
app.describe()
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | TOTALAREA_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.000000 | 307511.000000 | 307511.000000 | 3.075110e+05 | 3.075110e+05 | 307499.000000 | 3.072330e+05 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 104582.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307509.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 134133.000000 | 3.068510e+05 | 246546.000000 | 151450.00000 | 127568.000000 | 157504.000000 | 103023.000000 | 92646.000000 | 143620.000000 | 152683.000000 | 154491.000000 | 98869.000000 | 124921.000000 | 97312.000000 | 153161.000000 | 93997.000000 | 137829.000000 | 151450.000000 | 127568.000000 | 157504.000000 | 103023.000000 | 92646.000000 | 143620.000000 | 152683.000000 | 154491.000000 | 98869.000000 | 124921.000000 | 97312.000000 | 153161.000000 | 93997.000000 | 137829.000000 | 151450.000000 | 127568.000000 | 157504.000000 | 103023.000000 | 92646.000000 | 143620.000000 | 152683.000000 | 154491.000000 | 98869.000000 | 124921.000000 | 97312.000000 | 153161.000000 | 93997.000000 | 137829.000000 | 159080.000000 | 306490.000000 | 306490.000000 | 306490.000000 | 306490.000000 | 307510.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.00000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 |
| mean | 278180.518577 | 0.080729 | 0.417052 | 1.687979e+05 | 5.990260e+05 | 27108.573909 | 5.383962e+05 | 0.020868 | -16036.995067 | 63815.045904 | -4986.120328 | -2994.202373 | 12.061091 | 0.999997 | 0.819889 | 0.199368 | 0.998133 | 0.281066 | 0.056720 | 2.152665 | 2.052463 | 2.031521 | 12.063419 | 0.015144 | 0.050769 | 0.040659 | 0.078173 | 0.230454 | 0.179555 | 0.502130 | 5.143927e-01 | 0.510853 | 0.11744 | 0.088442 | 0.977735 | 0.752471 | 0.044621 | 0.078942 | 0.149725 | 0.226282 | 0.231894 | 0.066333 | 0.100775 | 0.107399 | 0.008809 | 0.028358 | 0.114231 | 0.087543 | 0.977065 | 0.759637 | 0.042553 | 0.074490 | 0.145193 | 0.222315 | 0.228058 | 0.064958 | 0.105645 | 0.105975 | 0.008076 | 0.027022 | 0.117850 | 0.087955 | 0.977752 | 0.755746 | 0.044595 | 0.078078 | 0.149213 | 0.225897 | 0.231625 | 0.067169 | 0.101954 | 0.108607 | 0.008651 | 0.028236 | 0.102547 | 1.422245 | 0.143421 | 1.405292 | 0.100049 | -962.858788 | 0.000042 | 0.710023 | 0.000081 | 0.015115 | 0.088055 | 0.000192 | 0.081376 | 0.003896 | 0.000023 | 0.003912 | 0.000007 | 0.003525 | 0.002936 | 0.00121 | 0.009928 | 0.000267 | 0.008130 | 0.000595 | 0.000507 | 0.000335 | 0.006402 | 0.007000 | 0.034362 | 0.267395 | 0.265474 | 1.899974 |
| std | 102790.175348 | 0.272419 | 0.722121 | 2.371231e+05 | 4.024908e+05 | 14493.737315 | 3.694465e+05 | 0.013831 | 4363.988632 | 141275.766519 | 3522.886321 | 1509.450419 | 11.944812 | 0.001803 | 0.384280 | 0.399526 | 0.043164 | 0.449521 | 0.231307 | 0.910682 | 0.509034 | 0.502737 | 3.265832 | 0.122126 | 0.219526 | 0.197499 | 0.268444 | 0.421124 | 0.383817 | 0.211062 | 1.910602e-01 | 0.194844 | 0.10824 | 0.082438 | 0.059223 | 0.113280 | 0.076036 | 0.134576 | 0.100049 | 0.144641 | 0.161380 | 0.081184 | 0.092576 | 0.110565 | 0.047732 | 0.069523 | 0.107936 | 0.084307 | 0.064575 | 0.110111 | 0.074445 | 0.132256 | 0.100977 | 0.143709 | 0.161160 | 0.081750 | 0.097880 | 0.111845 | 0.046276 | 0.070254 | 0.109076 | 0.082179 | 0.059897 | 0.112066 | 0.076144 | 0.134467 | 0.100368 | 0.145067 | 0.161934 | 0.082167 | 0.093642 | 0.112260 | 0.047415 | 0.070166 | 0.107462 | 2.400989 | 0.446698 | 2.379803 | 0.362291 | 826.808487 | 0.006502 | 0.453752 | 0.009016 | 0.122010 | 0.283376 | 0.013850 | 0.273412 | 0.062295 | 0.004771 | 0.062424 | 0.002550 | 0.059268 | 0.054110 | 0.03476 | 0.099144 | 0.016327 | 0.089798 | 0.024387 | 0.022518 | 0.018299 | 0.083849 | 0.110757 | 0.204685 | 0.916002 | 0.794056 | 1.869295 |
| min | 100002.000000 | 0.000000 | 0.000000 | 2.565000e+04 | 4.500000e+04 | 1615.500000 | 4.050000e+04 | 0.000290 | -25229.000000 | -17912.000000 | -24672.000000 | -7197.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.014568 | 8.173617e-08 | 0.000527 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -4292.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 189145.500000 | 0.000000 | 0.000000 | 1.125000e+05 | 2.700000e+05 | 16524.000000 | 2.385000e+05 | 0.010006 | -19682.000000 | -2760.000000 | -7479.500000 | -4299.000000 | 5.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 2.000000 | 10.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.334007 | 3.924574e-01 | 0.370650 | 0.05770 | 0.044200 | 0.976700 | 0.687200 | 0.007800 | 0.000000 | 0.069000 | 0.166700 | 0.083300 | 0.018700 | 0.050400 | 0.045300 | 0.000000 | 0.000000 | 0.052500 | 0.040700 | 0.976700 | 0.699400 | 0.007200 | 0.000000 | 0.069000 | 0.166700 | 0.083300 | 0.016600 | 0.054200 | 0.042700 | 0.000000 | 0.000000 | 0.058300 | 0.043700 | 0.976700 | 0.691400 | 0.007900 | 0.000000 | 0.069000 | 0.166700 | 0.083300 | 0.018700 | 0.051300 | 0.045700 | 0.000000 | 0.000000 | 0.041200 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -1570.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 278202.000000 | 0.000000 | 0.000000 | 1.471500e+05 | 5.135310e+05 | 24903.000000 | 4.500000e+05 | 0.018850 | -15750.000000 | -1213.000000 | -4504.000000 | -3254.000000 | 9.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 2.000000 | 12.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.505998 | 5.659614e-01 | 0.535276 | 0.08760 | 0.076300 | 0.981600 | 0.755200 | 0.021100 | 0.000000 | 0.137900 | 0.166700 | 0.208300 | 0.048100 | 0.075600 | 0.074500 | 0.000000 | 0.003600 | 0.084000 | 0.074600 | 0.981600 | 0.764800 | 0.019000 | 0.000000 | 0.137900 | 0.166700 | 0.208300 | 0.045800 | 0.077100 | 0.073100 | 0.000000 | 0.001100 | 0.086400 | 0.075800 | 0.981600 | 0.758500 | 0.020800 | 0.000000 | 0.137900 | 0.166700 | 0.208300 | 0.048700 | 0.076100 | 0.074900 | 0.000000 | 0.003100 | 0.068800 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -757.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 367142.500000 | 0.000000 | 1.000000 | 2.025000e+05 | 8.086500e+05 | 34596.000000 | 6.795000e+05 | 0.028663 | -12413.000000 | -289.000000 | -2010.000000 | -1720.000000 | 15.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 3.000000 | 2.000000 | 2.000000 | 14.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.675053 | 6.636171e-01 | 0.669057 | 0.14850 | 0.112200 | 0.986600 | 0.823200 | 0.051500 | 0.120000 | 0.206900 | 0.333300 | 0.375000 | 0.085600 | 0.121000 | 0.129900 | 0.003900 | 0.027700 | 0.143900 | 0.112400 | 0.986600 | 0.823600 | 0.049000 | 0.120800 | 0.206900 | 0.333300 | 0.375000 | 0.084100 | 0.131300 | 0.125200 | 0.003900 | 0.023100 | 0.148900 | 0.111600 | 0.986600 | 0.825600 | 0.051300 | 0.120000 | 0.206900 | 0.333300 | 0.375000 | 0.086800 | 0.123100 | 0.130300 | 0.003900 | 0.026600 | 0.127600 | 2.000000 | 0.000000 | 2.000000 | 0.000000 | -274.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
| max | 456255.000000 | 1.000000 | 19.000000 | 1.170000e+08 | 4.050000e+06 | 258025.500000 | 4.050000e+06 | 0.072508 | -7489.000000 | 365243.000000 | 0.000000 | 0.000000 | 91.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 20.000000 | 3.000000 | 3.000000 | 23.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.962693 | 8.549997e-01 | 0.896010 | 1.00000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 348.000000 | 34.000000 | 344.000000 | 24.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.00000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 9.000000 | 8.000000 | 27.000000 | 261.000000 | 25.000000 |
prev.describe()
| SK_ID_PREV | SK_ID_CURR | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | HOUR_APPR_PROCESS_START | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | RATE_INTEREST_PRIMARY | RATE_INTEREST_PRIVILEGED | DAYS_DECISION | SELLERPLACE_AREA | CNT_PAYMENT | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.670214e+06 | 1.670214e+06 | 1.297979e+06 | 1.670214e+06 | 1.670213e+06 | 7.743700e+05 | 1.284699e+06 | 1.670214e+06 | 1.670214e+06 | 774370.000000 | 5951.000000 | 5951.000000 | 1.670214e+06 | 1.670214e+06 | 1.297984e+06 | 997149.000000 | 997149.000000 | 997149.000000 | 997149.000000 | 997149.000000 | 997149.000000 |
| mean | 1.923089e+06 | 2.783572e+05 | 1.595512e+04 | 1.752339e+05 | 1.961140e+05 | 6.697402e+03 | 2.278473e+05 | 1.248418e+01 | 9.964675e-01 | 0.079637 | 0.188357 | 0.773503 | -8.806797e+02 | 3.139511e+02 | 1.605408e+01 | 342209.855039 | 13826.269337 | 33767.774054 | 76582.403064 | 81992.343838 | 0.332570 |
| std | 5.325980e+05 | 1.028148e+05 | 1.478214e+04 | 2.927798e+05 | 3.185746e+05 | 2.092150e+04 | 3.153966e+05 | 3.334028e+00 | 5.932963e-02 | 0.107823 | 0.087671 | 0.100879 | 7.790997e+02 | 7.127443e+03 | 1.456729e+01 | 88916.115834 | 72444.869708 | 106857.034789 | 149647.415123 | 153303.516729 | 0.471134 |
| min | 1.000001e+06 | 1.000010e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -9.000000e-01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -0.000015 | 0.034781 | 0.373150 | -2.922000e+03 | -1.000000e+00 | 0.000000e+00 | -2922.000000 | -2892.000000 | -2801.000000 | -2889.000000 | -2874.000000 | 0.000000 |
| 25% | 1.461857e+06 | 1.893290e+05 | 6.321780e+03 | 1.872000e+04 | 2.416050e+04 | 0.000000e+00 | 5.084100e+04 | 1.000000e+01 | 1.000000e+00 | 0.000000 | 0.160716 | 0.715645 | -1.300000e+03 | -1.000000e+00 | 6.000000e+00 | 365243.000000 | -1628.000000 | -1242.000000 | -1314.000000 | -1270.000000 | 0.000000 |
| 50% | 1.923110e+06 | 2.787145e+05 | 1.125000e+04 | 7.104600e+04 | 8.054100e+04 | 1.638000e+03 | 1.123200e+05 | 1.200000e+01 | 1.000000e+00 | 0.051605 | 0.189122 | 0.835095 | -5.810000e+02 | 3.000000e+00 | 1.200000e+01 | 365243.000000 | -831.000000 | -361.000000 | -537.000000 | -499.000000 | 0.000000 |
| 75% | 2.384280e+06 | 3.675140e+05 | 2.065842e+04 | 1.803600e+05 | 2.164185e+05 | 7.740000e+03 | 2.340000e+05 | 1.500000e+01 | 1.000000e+00 | 0.108909 | 0.193330 | 0.852537 | -2.800000e+02 | 8.200000e+01 | 2.400000e+01 | 365243.000000 | -411.000000 | 129.000000 | -74.000000 | -44.000000 | 1.000000 |
| max | 2.845382e+06 | 4.562550e+05 | 4.180581e+05 | 6.905160e+06 | 6.905160e+06 | 3.060045e+06 | 6.905160e+06 | 2.300000e+01 | 1.000000e+00 | 1.000000 | 1.000000 | 1.000000 | -1.000000e+00 | 4.000000e+06 | 8.400000e+01 | 365243.000000 | 365243.000000 | 365243.000000 | 365243.000000 | 365243.000000 | 1.000000 |
col.describe()
| Unnamed: 0 | |
|---|---|
| count | 160.00000 |
| mean | 94.58750 |
| std | 64.12727 |
| min | 1.00000 |
| 25% | 42.75000 |
| 50% | 82.50000 |
| 75% | 122.25000 |
| max | 213.00000 |
app.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB
prev.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1670214 entries, 0 to 1670213 Data columns (total 37 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SK_ID_PREV 1670214 non-null int64 1 SK_ID_CURR 1670214 non-null int64 2 NAME_CONTRACT_TYPE 1670214 non-null object 3 AMT_ANNUITY 1297979 non-null float64 4 AMT_APPLICATION 1670214 non-null float64 5 AMT_CREDIT 1670213 non-null float64 6 AMT_DOWN_PAYMENT 774370 non-null float64 7 AMT_GOODS_PRICE 1284699 non-null float64 8 WEEKDAY_APPR_PROCESS_START 1670214 non-null object 9 HOUR_APPR_PROCESS_START 1670214 non-null int64 10 FLAG_LAST_APPL_PER_CONTRACT 1670214 non-null object 11 NFLAG_LAST_APPL_IN_DAY 1670214 non-null int64 12 RATE_DOWN_PAYMENT 774370 non-null float64 13 RATE_INTEREST_PRIMARY 5951 non-null float64 14 RATE_INTEREST_PRIVILEGED 5951 non-null float64 15 NAME_CASH_LOAN_PURPOSE 1670214 non-null object 16 NAME_CONTRACT_STATUS 1670214 non-null object 17 DAYS_DECISION 1670214 non-null int64 18 NAME_PAYMENT_TYPE 1670214 non-null object 19 CODE_REJECT_REASON 1670214 non-null object 20 NAME_TYPE_SUITE 849809 non-null object 21 NAME_CLIENT_TYPE 1670214 non-null object 22 NAME_GOODS_CATEGORY 1670214 non-null object 23 NAME_PORTFOLIO 1670214 non-null object 24 NAME_PRODUCT_TYPE 1670214 non-null object 25 CHANNEL_TYPE 1670214 non-null object 26 SELLERPLACE_AREA 1670214 non-null int64 27 NAME_SELLER_INDUSTRY 1670214 non-null object 28 CNT_PAYMENT 1297984 non-null float64 29 NAME_YIELD_GROUP 1670214 non-null object 30 PRODUCT_COMBINATION 1669868 non-null object 31 DAYS_FIRST_DRAWING 997149 non-null float64 32 DAYS_FIRST_DUE 997149 non-null float64 33 DAYS_LAST_DUE_1ST_VERSION 997149 non-null float64 34 DAYS_LAST_DUE 997149 non-null float64 35 DAYS_TERMINATION 997149 non-null float64 36 NFLAG_INSURED_ON_APPROVAL 997149 non-null float64 dtypes: float64(15), int64(6), object(16) memory usage: 471.5+ MB
col.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 160 entries, 0 to 159 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 160 non-null int64 1 Table 160 non-null object 2 Row 160 non-null object 3 Description 160 non-null object 4 Special 69 non-null object dtypes: int64(1), object(4) memory usage: 6.4+ KB
app.shape
(307511, 122)
prev.shape
(1670214, 37)
col.shape
(160, 5)
col.isnull().sum()
Unnamed: 0 0 Table 0 Row 0 Description 0 Special 91 dtype: int64
prev.isnull().sum()
SK_ID_PREV 0 SK_ID_CURR 0 NAME_CONTRACT_TYPE 0 AMT_ANNUITY 372235 AMT_APPLICATION 0 AMT_CREDIT 1 AMT_DOWN_PAYMENT 895844 AMT_GOODS_PRICE 385515 WEEKDAY_APPR_PROCESS_START 0 HOUR_APPR_PROCESS_START 0 FLAG_LAST_APPL_PER_CONTRACT 0 NFLAG_LAST_APPL_IN_DAY 0 RATE_DOWN_PAYMENT 895844 RATE_INTEREST_PRIMARY 1664263 RATE_INTEREST_PRIVILEGED 1664263 NAME_CASH_LOAN_PURPOSE 0 NAME_CONTRACT_STATUS 0 DAYS_DECISION 0 NAME_PAYMENT_TYPE 0 CODE_REJECT_REASON 0 NAME_TYPE_SUITE 820405 NAME_CLIENT_TYPE 0 NAME_GOODS_CATEGORY 0 NAME_PORTFOLIO 0 NAME_PRODUCT_TYPE 0 CHANNEL_TYPE 0 SELLERPLACE_AREA 0 NAME_SELLER_INDUSTRY 0 CNT_PAYMENT 372230 NAME_YIELD_GROUP 0 PRODUCT_COMBINATION 346 DAYS_FIRST_DRAWING 673065 DAYS_FIRST_DUE 673065 DAYS_LAST_DUE_1ST_VERSION 673065 DAYS_LAST_DUE 673065 DAYS_TERMINATION 673065 NFLAG_INSURED_ON_APPROVAL 673065 dtype: int64
# Calculate percentage of missing values
miss_previous_application = pd.DataFrame({
'column': prev.columns,
'missing_percentage': (prev.isnull().sum() * 100) / prev.shape[0]
}).reset_index(drop=True)
# Add a type column
miss_previous_application["type"] = "previous_application"
# Plot
fig = plt.figure(figsize=(18, 6))
ax = sns.pointplot(x="column", y="missing_percentage", data=miss_previous_application, hue="type")
plt.xticks(rotation=90, fontsize=7)
plt.title("Percentage of Missing Values in previous_application")
plt.ylabel("Percentage")
plt.xlabel("Columns")
ax.set_facecolor("k")
fig.set_facecolor("lightgrey")
plt.show()
round(100*(prev.isnull().sum()/len(prev.index)),2)
SK_ID_PREV 0.00 SK_ID_CURR 0.00 NAME_CONTRACT_TYPE 0.00 AMT_ANNUITY 22.29 AMT_APPLICATION 0.00 AMT_CREDIT 0.00 AMT_DOWN_PAYMENT 53.64 AMT_GOODS_PRICE 23.08 WEEKDAY_APPR_PROCESS_START 0.00 HOUR_APPR_PROCESS_START 0.00 FLAG_LAST_APPL_PER_CONTRACT 0.00 NFLAG_LAST_APPL_IN_DAY 0.00 RATE_DOWN_PAYMENT 53.64 RATE_INTEREST_PRIMARY 99.64 RATE_INTEREST_PRIVILEGED 99.64 NAME_CASH_LOAN_PURPOSE 0.00 NAME_CONTRACT_STATUS 0.00 DAYS_DECISION 0.00 NAME_PAYMENT_TYPE 0.00 CODE_REJECT_REASON 0.00 NAME_TYPE_SUITE 49.12 NAME_CLIENT_TYPE 0.00 NAME_GOODS_CATEGORY 0.00 NAME_PORTFOLIO 0.00 NAME_PRODUCT_TYPE 0.00 CHANNEL_TYPE 0.00 SELLERPLACE_AREA 0.00 NAME_SELLER_INDUSTRY 0.00 CNT_PAYMENT 22.29 NAME_YIELD_GROUP 0.00 PRODUCT_COMBINATION 0.02 DAYS_FIRST_DRAWING 40.30 DAYS_FIRST_DUE 40.30 DAYS_LAST_DUE_1ST_VERSION 40.30 DAYS_LAST_DUE 40.30 DAYS_TERMINATION 40.30 NFLAG_INSURED_ON_APPROVAL 40.30 dtype: float64
previous_application=prev.drop([ 'AMT_DOWN_PAYMENT', 'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
"RATE_INTEREST_PRIVILEGED"],axis=1)
# Calculate the percentage of missing values for each column
miss_previous_application = pd.DataFrame({
'column': previous_application.columns,
'missing_percentage': (previous_application.isnull().sum() * 100) / previous_application.shape[0]
}).reset_index(drop=True)
# Add a "type" column for plotting
miss_previous_application["type"] = "previous_application"
# Create the plot
fig = plt.figure(figsize=(18, 6))
ax = sns.pointplot(x="column", y="missing_percentage", data=miss_previous_application, hue="type")
plt.xticks(rotation=90, fontsize=7)
plt.title("Percentage of Missing Values in previous_application")
plt.ylabel("Percentage")
plt.xlabel("Columns")
ax.set_facecolor("k")
fig.set_facecolor("lightgrey")
plt.show()
round(100*(previous_application.isnull().sum()/len(previous_application.index)),2)
SK_ID_PREV 0.00 SK_ID_CURR 0.00 NAME_CONTRACT_TYPE 0.00 AMT_ANNUITY 22.29 AMT_APPLICATION 0.00 AMT_CREDIT 0.00 AMT_GOODS_PRICE 23.08 WEEKDAY_APPR_PROCESS_START 0.00 HOUR_APPR_PROCESS_START 0.00 FLAG_LAST_APPL_PER_CONTRACT 0.00 NFLAG_LAST_APPL_IN_DAY 0.00 NAME_CASH_LOAN_PURPOSE 0.00 NAME_CONTRACT_STATUS 0.00 DAYS_DECISION 0.00 NAME_PAYMENT_TYPE 0.00 CODE_REJECT_REASON 0.00 NAME_TYPE_SUITE 49.12 NAME_CLIENT_TYPE 0.00 NAME_GOODS_CATEGORY 0.00 NAME_PORTFOLIO 0.00 NAME_PRODUCT_TYPE 0.00 CHANNEL_TYPE 0.00 SELLERPLACE_AREA 0.00 NAME_SELLER_INDUSTRY 0.00 CNT_PAYMENT 22.29 NAME_YIELD_GROUP 0.00 PRODUCT_COMBINATION 0.02 DAYS_FIRST_DRAWING 40.30 DAYS_FIRST_DUE 40.30 DAYS_LAST_DUE_1ST_VERSION 40.30 DAYS_LAST_DUE 40.30 DAYS_TERMINATION 40.30 NFLAG_INSURED_ON_APPROVAL 40.30 dtype: float64
print("AMT_ANNUITY NULL COUNT:" ,previous_application['AMT_ANNUITY'].isnull().sum())
AMT_ANNUITY NULL COUNT: 372235
previous_application['AMT_ANNUITY'].describe()
count 1.297979e+06 mean 1.595512e+04 std 1.478214e+04 min 0.000000e+00 25% 6.321780e+03 50% 1.125000e+04 75% 2.065842e+04 max 4.180581e+05 Name: AMT_ANNUITY, dtype: float64
sns.set_style('whitegrid')
sns.distplot(previous_application['AMT_ANNUITY'])
plt.show()
print("AMT_GOODS_PRICE NULL COUNT:" ,previous_application['AMT_GOODS_PRICE'].isnull().sum())
AMT_GOODS_PRICE NULL COUNT: 385515
previous_application['AMT_GOODS_PRICE'].describe()
count 1.284699e+06 mean 2.278473e+05 std 3.153966e+05 min 0.000000e+00 25% 5.084100e+04 50% 1.123200e+05 75% 2.340000e+05 max 6.905160e+06 Name: AMT_GOODS_PRICE, dtype: float64
sns.set_style('whitegrid')
sns.distplot(previous_application['AMT_GOODS_PRICE'])
plt.show()
print("NAME_TYPE_SUITE NULL COUNT:" ,previous_application['NAME_TYPE_SUITE'].isnull().sum())
NAME_TYPE_SUITE NULL COUNT: 820405
previous_application['NAME_TYPE_SUITE'].value_counts()
NAME_TYPE_SUITE Unaccompanied 508970 Family 213263 Spouse, partner 67069 Children 31566 Other_B 17624 Other_A 9077 Group of people 2240 Name: count, dtype: int64
print("CNT_PAYMENT NULL COUNT:" ,previous_application['CNT_PAYMENT'].isnull().sum())
CNT_PAYMENT NULL COUNT: 372230
previous_application['CNT_PAYMENT'].describe()
count 1.297984e+06 mean 1.605408e+01 std 1.456729e+01 min 0.000000e+00 25% 6.000000e+00 50% 1.200000e+01 75% 2.400000e+01 max 8.400000e+01 Name: CNT_PAYMENT, dtype: float64
sns.set_style('whitegrid')
sns.boxplot(previous_application['CNT_PAYMENT'])
plt.show()
previous_application.isnull().sum()
SK_ID_PREV 0 SK_ID_CURR 0 NAME_CONTRACT_TYPE 0 AMT_ANNUITY 372235 AMT_APPLICATION 0 AMT_CREDIT 1 AMT_GOODS_PRICE 385515 WEEKDAY_APPR_PROCESS_START 0 HOUR_APPR_PROCESS_START 0 FLAG_LAST_APPL_PER_CONTRACT 0 NFLAG_LAST_APPL_IN_DAY 0 NAME_CASH_LOAN_PURPOSE 0 NAME_CONTRACT_STATUS 0 DAYS_DECISION 0 NAME_PAYMENT_TYPE 0 CODE_REJECT_REASON 0 NAME_TYPE_SUITE 820405 NAME_CLIENT_TYPE 0 NAME_GOODS_CATEGORY 0 NAME_PORTFOLIO 0 NAME_PRODUCT_TYPE 0 CHANNEL_TYPE 0 SELLERPLACE_AREA 0 NAME_SELLER_INDUSTRY 0 CNT_PAYMENT 372230 NAME_YIELD_GROUP 0 PRODUCT_COMBINATION 346 DAYS_FIRST_DRAWING 673065 DAYS_FIRST_DUE 673065 DAYS_LAST_DUE_1ST_VERSION 673065 DAYS_LAST_DUE 673065 DAYS_TERMINATION 673065 NFLAG_INSURED_ON_APPROVAL 673065 dtype: int64
print("AMT_CREDIT :" ,previous_application['AMT_CREDIT'].isnull().sum())
AMT_CREDIT : 1
previous_application['AMT_CREDIT'].describe()
count 1.670213e+06 mean 1.961140e+05 std 3.185746e+05 min 0.000000e+00 25% 2.416050e+04 50% 8.054100e+04 75% 2.164185e+05 max 6.905160e+06 Name: AMT_CREDIT, dtype: float64
sns.set_style('whitegrid')
sns.boxplot(previous_application['AMT_CREDIT'])
plt.show()
previous_application['PRODUCT_COMBINATION'].value_counts()
PRODUCT_COMBINATION Cash 285990 POS household with interest 263622 POS mobile with interest 220670 Cash X-Sell: middle 143883 Cash X-Sell: low 130248 Card Street 112582 POS industry with interest 98833 POS household without interest 82908 Card X-Sell 80582 Cash Street: high 59639 Cash X-Sell: high 59301 Cash Street: middle 34658 Cash Street: low 33834 POS mobile without interest 24082 POS other with interest 23879 POS industry without interest 12602 POS others without interest 2555 Name: count, dtype: int64
class color:
PURPLE = '\033[95m'
CYAN = '\033[96m'
DARKCYAN = '\033[36m'
BLUE = '\033[94m'
GREEN = '\033[92m'
YELLOW = '\033[93m'
RED = '\033[91m'
BOLD = '\033[1m'
UNDERLINE = '\033[4m'
END = '\033[0m'
obj_dtypes = [i for i in previous_application.select_dtypes(include=object).columns if i not in ["type"]]
num_dtypes = [i for i in previous_application.select_dtypes(include=np.number).columns if i not in ['SK_ID_CURR', 'TARGET']]
print(color.BOLD + color.PURPLE + 'Categorical Columns' + color.END, "\n")
for x in range(len(obj_dtypes)):
print(obj_dtypes[x])
Categorical Columns
NAME_CONTRACT_TYPE
WEEKDAY_APPR_PROCESS_START
FLAG_LAST_APPL_PER_CONTRACT
NAME_CASH_LOAN_PURPOSE
NAME_CONTRACT_STATUS
NAME_PAYMENT_TYPE
CODE_REJECT_REASON
NAME_TYPE_SUITE
NAME_CLIENT_TYPE
NAME_GOODS_CATEGORY
NAME_PORTFOLIO
NAME_PRODUCT_TYPE
CHANNEL_TYPE
NAME_SELLER_INDUSTRY
NAME_YIELD_GROUP
PRODUCT_COMBINATION
print(color.BOLD + color.PURPLE + 'Numerical' + color.END, "\n")
for x in range(len(obj_dtypes)):
print(obj_dtypes[x])
Numerical
NAME_CONTRACT_TYPE
WEEKDAY_APPR_PROCESS_START
FLAG_LAST_APPL_PER_CONTRACT
NAME_CASH_LOAN_PURPOSE
NAME_CONTRACT_STATUS
NAME_PAYMENT_TYPE
CODE_REJECT_REASON
NAME_TYPE_SUITE
NAME_CLIENT_TYPE
NAME_GOODS_CATEGORY
NAME_PORTFOLIO
NAME_PRODUCT_TYPE
CHANNEL_TYPE
NAME_SELLER_INDUSTRY
NAME_YIELD_GROUP
PRODUCT_COMBINATION
app.isnull().sum()
SK_ID_CURR 0 TARGET 0 NAME_CONTRACT_TYPE 0 CODE_GENDER 0 FLAG_OWN_CAR 0 FLAG_OWN_REALTY 0 CNT_CHILDREN 0 AMT_INCOME_TOTAL 0 AMT_CREDIT 0 AMT_ANNUITY 12 AMT_GOODS_PRICE 278 NAME_TYPE_SUITE 1292 NAME_INCOME_TYPE 0 NAME_EDUCATION_TYPE 0 NAME_FAMILY_STATUS 0 NAME_HOUSING_TYPE 0 REGION_POPULATION_RELATIVE 0 DAYS_BIRTH 0 DAYS_EMPLOYED 0 DAYS_REGISTRATION 0 DAYS_ID_PUBLISH 0 OWN_CAR_AGE 202929 FLAG_MOBIL 0 FLAG_EMP_PHONE 0 FLAG_WORK_PHONE 0 FLAG_CONT_MOBILE 0 FLAG_PHONE 0 FLAG_EMAIL 0 OCCUPATION_TYPE 96391 CNT_FAM_MEMBERS 2 REGION_RATING_CLIENT 0 REGION_RATING_CLIENT_W_CITY 0 WEEKDAY_APPR_PROCESS_START 0 HOUR_APPR_PROCESS_START 0 REG_REGION_NOT_LIVE_REGION 0 REG_REGION_NOT_WORK_REGION 0 LIVE_REGION_NOT_WORK_REGION 0 REG_CITY_NOT_LIVE_CITY 0 REG_CITY_NOT_WORK_CITY 0 LIVE_CITY_NOT_WORK_CITY 0 ORGANIZATION_TYPE 0 EXT_SOURCE_1 173378 EXT_SOURCE_2 660 EXT_SOURCE_3 60965 APARTMENTS_AVG 156061 BASEMENTAREA_AVG 179943 YEARS_BEGINEXPLUATATION_AVG 150007 YEARS_BUILD_AVG 204488 COMMONAREA_AVG 214865 ELEVATORS_AVG 163891 ENTRANCES_AVG 154828 FLOORSMAX_AVG 153020 FLOORSMIN_AVG 208642 LANDAREA_AVG 182590 LIVINGAPARTMENTS_AVG 210199 LIVINGAREA_AVG 154350 NONLIVINGAPARTMENTS_AVG 213514 NONLIVINGAREA_AVG 169682 APARTMENTS_MODE 156061 BASEMENTAREA_MODE 179943 YEARS_BEGINEXPLUATATION_MODE 150007 YEARS_BUILD_MODE 204488 COMMONAREA_MODE 214865 ELEVATORS_MODE 163891 ENTRANCES_MODE 154828 FLOORSMAX_MODE 153020 FLOORSMIN_MODE 208642 LANDAREA_MODE 182590 LIVINGAPARTMENTS_MODE 210199 LIVINGAREA_MODE 154350 NONLIVINGAPARTMENTS_MODE 213514 NONLIVINGAREA_MODE 169682 APARTMENTS_MEDI 156061 BASEMENTAREA_MEDI 179943 YEARS_BEGINEXPLUATATION_MEDI 150007 YEARS_BUILD_MEDI 204488 COMMONAREA_MEDI 214865 ELEVATORS_MEDI 163891 ENTRANCES_MEDI 154828 FLOORSMAX_MEDI 153020 FLOORSMIN_MEDI 208642 LANDAREA_MEDI 182590 LIVINGAPARTMENTS_MEDI 210199 LIVINGAREA_MEDI 154350 NONLIVINGAPARTMENTS_MEDI 213514 NONLIVINGAREA_MEDI 169682 FONDKAPREMONT_MODE 210295 HOUSETYPE_MODE 154297 TOTALAREA_MODE 148431 WALLSMATERIAL_MODE 156341 EMERGENCYSTATE_MODE 145755 OBS_30_CNT_SOCIAL_CIRCLE 1021 DEF_30_CNT_SOCIAL_CIRCLE 1021 OBS_60_CNT_SOCIAL_CIRCLE 1021 DEF_60_CNT_SOCIAL_CIRCLE 1021 DAYS_LAST_PHONE_CHANGE 1 FLAG_DOCUMENT_2 0 FLAG_DOCUMENT_3 0 FLAG_DOCUMENT_4 0 FLAG_DOCUMENT_5 0 FLAG_DOCUMENT_6 0 FLAG_DOCUMENT_7 0 FLAG_DOCUMENT_8 0 FLAG_DOCUMENT_9 0 FLAG_DOCUMENT_10 0 FLAG_DOCUMENT_11 0 FLAG_DOCUMENT_12 0 FLAG_DOCUMENT_13 0 FLAG_DOCUMENT_14 0 FLAG_DOCUMENT_15 0 FLAG_DOCUMENT_16 0 FLAG_DOCUMENT_17 0 FLAG_DOCUMENT_18 0 FLAG_DOCUMENT_19 0 FLAG_DOCUMENT_20 0 FLAG_DOCUMENT_21 0 AMT_REQ_CREDIT_BUREAU_HOUR 41519 AMT_REQ_CREDIT_BUREAU_DAY 41519 AMT_REQ_CREDIT_BUREAU_WEEK 41519 AMT_REQ_CREDIT_BUREAU_MON 41519 AMT_REQ_CREDIT_BUREAU_QRT 41519 AMT_REQ_CREDIT_BUREAU_YEAR 41519 dtype: int64
Percentage of Missing values in application_data
fig = plt.figure(figsize=(18, 6))
miss_application_data = pd.DataFrame((app.isnull().sum()) * 100 / app.shape[0]).reset_index()
miss_application_data.columns = ["index", "percentage"] # Rename columns for clarity
miss_application_data["type"] = "application_data"
ax = sns.pointplot(x="index", y="percentage", data=miss_application_data, hue="type")
plt.xticks(rotation=90, fontsize=7)
plt.title("Percentage of Missing Values in application_data")
plt.ylabel("PERCENTAGE")
plt.xlabel("COLUMNS")
ax.set_facecolor("k")
fig.set_facecolor("lightgrey")
plt.show()
round(100*(app.isnull().sum()/len(app.index)),2)
SK_ID_CURR 0.00 TARGET 0.00 NAME_CONTRACT_TYPE 0.00 CODE_GENDER 0.00 FLAG_OWN_CAR 0.00 FLAG_OWN_REALTY 0.00 CNT_CHILDREN 0.00 AMT_INCOME_TOTAL 0.00 AMT_CREDIT 0.00 AMT_ANNUITY 0.00 AMT_GOODS_PRICE 0.09 NAME_TYPE_SUITE 0.42 NAME_INCOME_TYPE 0.00 NAME_EDUCATION_TYPE 0.00 NAME_FAMILY_STATUS 0.00 NAME_HOUSING_TYPE 0.00 REGION_POPULATION_RELATIVE 0.00 DAYS_BIRTH 0.00 DAYS_EMPLOYED 0.00 DAYS_REGISTRATION 0.00 DAYS_ID_PUBLISH 0.00 OWN_CAR_AGE 65.99 FLAG_MOBIL 0.00 FLAG_EMP_PHONE 0.00 FLAG_WORK_PHONE 0.00 FLAG_CONT_MOBILE 0.00 FLAG_PHONE 0.00 FLAG_EMAIL 0.00 OCCUPATION_TYPE 31.35 CNT_FAM_MEMBERS 0.00 REGION_RATING_CLIENT 0.00 REGION_RATING_CLIENT_W_CITY 0.00 WEEKDAY_APPR_PROCESS_START 0.00 HOUR_APPR_PROCESS_START 0.00 REG_REGION_NOT_LIVE_REGION 0.00 REG_REGION_NOT_WORK_REGION 0.00 LIVE_REGION_NOT_WORK_REGION 0.00 REG_CITY_NOT_LIVE_CITY 0.00 REG_CITY_NOT_WORK_CITY 0.00 LIVE_CITY_NOT_WORK_CITY 0.00 ORGANIZATION_TYPE 0.00 EXT_SOURCE_1 56.38 EXT_SOURCE_2 0.21 EXT_SOURCE_3 19.83 APARTMENTS_AVG 50.75 BASEMENTAREA_AVG 58.52 YEARS_BEGINEXPLUATATION_AVG 48.78 YEARS_BUILD_AVG 66.50 COMMONAREA_AVG 69.87 ELEVATORS_AVG 53.30 ENTRANCES_AVG 50.35 FLOORSMAX_AVG 49.76 FLOORSMIN_AVG 67.85 LANDAREA_AVG 59.38 LIVINGAPARTMENTS_AVG 68.35 LIVINGAREA_AVG 50.19 NONLIVINGAPARTMENTS_AVG 69.43 NONLIVINGAREA_AVG 55.18 APARTMENTS_MODE 50.75 BASEMENTAREA_MODE 58.52 YEARS_BEGINEXPLUATATION_MODE 48.78 YEARS_BUILD_MODE 66.50 COMMONAREA_MODE 69.87 ELEVATORS_MODE 53.30 ENTRANCES_MODE 50.35 FLOORSMAX_MODE 49.76 FLOORSMIN_MODE 67.85 LANDAREA_MODE 59.38 LIVINGAPARTMENTS_MODE 68.35 LIVINGAREA_MODE 50.19 NONLIVINGAPARTMENTS_MODE 69.43 NONLIVINGAREA_MODE 55.18 APARTMENTS_MEDI 50.75 BASEMENTAREA_MEDI 58.52 YEARS_BEGINEXPLUATATION_MEDI 48.78 YEARS_BUILD_MEDI 66.50 COMMONAREA_MEDI 69.87 ELEVATORS_MEDI 53.30 ENTRANCES_MEDI 50.35 FLOORSMAX_MEDI 49.76 FLOORSMIN_MEDI 67.85 LANDAREA_MEDI 59.38 LIVINGAPARTMENTS_MEDI 68.35 LIVINGAREA_MEDI 50.19 NONLIVINGAPARTMENTS_MEDI 69.43 NONLIVINGAREA_MEDI 55.18 FONDKAPREMONT_MODE 68.39 HOUSETYPE_MODE 50.18 TOTALAREA_MODE 48.27 WALLSMATERIAL_MODE 50.84 EMERGENCYSTATE_MODE 47.40 OBS_30_CNT_SOCIAL_CIRCLE 0.33 DEF_30_CNT_SOCIAL_CIRCLE 0.33 OBS_60_CNT_SOCIAL_CIRCLE 0.33 DEF_60_CNT_SOCIAL_CIRCLE 0.33 DAYS_LAST_PHONE_CHANGE 0.00 FLAG_DOCUMENT_2 0.00 FLAG_DOCUMENT_3 0.00 FLAG_DOCUMENT_4 0.00 FLAG_DOCUMENT_5 0.00 FLAG_DOCUMENT_6 0.00 FLAG_DOCUMENT_7 0.00 FLAG_DOCUMENT_8 0.00 FLAG_DOCUMENT_9 0.00 FLAG_DOCUMENT_10 0.00 FLAG_DOCUMENT_11 0.00 FLAG_DOCUMENT_12 0.00 FLAG_DOCUMENT_13 0.00 FLAG_DOCUMENT_14 0.00 FLAG_DOCUMENT_15 0.00 FLAG_DOCUMENT_16 0.00 FLAG_DOCUMENT_17 0.00 FLAG_DOCUMENT_18 0.00 FLAG_DOCUMENT_19 0.00 FLAG_DOCUMENT_20 0.00 FLAG_DOCUMENT_21 0.00 AMT_REQ_CREDIT_BUREAU_HOUR 13.50 AMT_REQ_CREDIT_BUREAU_DAY 13.50 AMT_REQ_CREDIT_BUREAU_WEEK 13.50 AMT_REQ_CREDIT_BUREAU_MON 13.50 AMT_REQ_CREDIT_BUREAU_QRT 13.50 AMT_REQ_CREDIT_BUREAU_YEAR 13.50 dtype: float64
As per Industrial Standard, max Threshold limit can be between 40% to 50 % depending upon the data acquired in specific sector.
application_data=app.drop([ 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG',
'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG',
'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG',
'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG',
'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE',
'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE',
'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE',
'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE',
'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI',
'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI',
'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI',
'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI',
'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI',
'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'TOTALAREA_MODE',
'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE',"OWN_CAR_AGE","OCCUPATION_TYPE"],axis=1)
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
fig = plt.figure(figsize=(18, 6))
# Calculate the percentage of missing values
miss_application_data = pd.DataFrame((application_data.isnull().sum()) * 100 / application_data.shape[0]).reset_index()
miss_application_data.columns = ["index", "percentage"] # Rename columns for clarity
miss_application_data["type"] = "application_data"
# Correctly specify x and y arguments
ax = sns.pointplot(x="index", y="percentage", data=miss_application_data, hue="type")
plt.xticks(rotation=90, fontsize=7)
plt.title("Percentage of Missing Values in application_data")
plt.ylabel("PERCENTAGE")
plt.xlabel("COLUMNS")
# Set background colors
ax.set_facecolor("k")
fig.set_facecolor("lightgrey")
plt.show()
round(100*(application_data.isnull().sum()/len(application_data.index)),2)
SK_ID_CURR 0.00 TARGET 0.00 NAME_CONTRACT_TYPE 0.00 CODE_GENDER 0.00 FLAG_OWN_CAR 0.00 FLAG_OWN_REALTY 0.00 CNT_CHILDREN 0.00 AMT_INCOME_TOTAL 0.00 AMT_CREDIT 0.00 AMT_ANNUITY 0.00 AMT_GOODS_PRICE 0.09 NAME_TYPE_SUITE 0.42 NAME_INCOME_TYPE 0.00 NAME_EDUCATION_TYPE 0.00 NAME_FAMILY_STATUS 0.00 NAME_HOUSING_TYPE 0.00 REGION_POPULATION_RELATIVE 0.00 DAYS_BIRTH 0.00 DAYS_EMPLOYED 0.00 DAYS_REGISTRATION 0.00 DAYS_ID_PUBLISH 0.00 FLAG_MOBIL 0.00 FLAG_EMP_PHONE 0.00 FLAG_WORK_PHONE 0.00 FLAG_CONT_MOBILE 0.00 FLAG_PHONE 0.00 FLAG_EMAIL 0.00 CNT_FAM_MEMBERS 0.00 REGION_RATING_CLIENT 0.00 REGION_RATING_CLIENT_W_CITY 0.00 WEEKDAY_APPR_PROCESS_START 0.00 HOUR_APPR_PROCESS_START 0.00 REG_REGION_NOT_LIVE_REGION 0.00 REG_REGION_NOT_WORK_REGION 0.00 LIVE_REGION_NOT_WORK_REGION 0.00 REG_CITY_NOT_LIVE_CITY 0.00 REG_CITY_NOT_WORK_CITY 0.00 LIVE_CITY_NOT_WORK_CITY 0.00 ORGANIZATION_TYPE 0.00 OBS_30_CNT_SOCIAL_CIRCLE 0.33 DEF_30_CNT_SOCIAL_CIRCLE 0.33 OBS_60_CNT_SOCIAL_CIRCLE 0.33 DEF_60_CNT_SOCIAL_CIRCLE 0.33 DAYS_LAST_PHONE_CHANGE 0.00 FLAG_DOCUMENT_2 0.00 FLAG_DOCUMENT_3 0.00 FLAG_DOCUMENT_4 0.00 FLAG_DOCUMENT_5 0.00 FLAG_DOCUMENT_6 0.00 FLAG_DOCUMENT_7 0.00 FLAG_DOCUMENT_8 0.00 FLAG_DOCUMENT_9 0.00 FLAG_DOCUMENT_10 0.00 FLAG_DOCUMENT_11 0.00 FLAG_DOCUMENT_12 0.00 FLAG_DOCUMENT_13 0.00 FLAG_DOCUMENT_14 0.00 FLAG_DOCUMENT_15 0.00 FLAG_DOCUMENT_16 0.00 FLAG_DOCUMENT_17 0.00 FLAG_DOCUMENT_18 0.00 FLAG_DOCUMENT_19 0.00 FLAG_DOCUMENT_20 0.00 FLAG_DOCUMENT_21 0.00 AMT_REQ_CREDIT_BUREAU_HOUR 13.50 AMT_REQ_CREDIT_BUREAU_DAY 13.50 AMT_REQ_CREDIT_BUREAU_WEEK 13.50 AMT_REQ_CREDIT_BUREAU_MON 13.50 AMT_REQ_CREDIT_BUREAU_QRT 13.50 AMT_REQ_CREDIT_BUREAU_YEAR 13.50 dtype: float64
print("AMT_REQ_CREDIT_BUREAU_DAY NAN COUNT :" ,application_data['AMT_REQ_CREDIT_BUREAU_DAY'].isnull().sum())
AMT_REQ_CREDIT_BUREAU_DAY NAN COUNT : 41519
application_data['AMT_REQ_CREDIT_BUREAU_DAY'].describe()
count 265992.000000 mean 0.007000 std 0.110757 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 9.000000 Name: AMT_REQ_CREDIT_BUREAU_DAY, dtype: float64
print("AMT_REQ_CREDIT_BUREAU_HOUR NAN COUNT :" ,application_data['AMT_REQ_CREDIT_BUREAU_HOUR'].isnull().sum())
AMT_REQ_CREDIT_BUREAU_HOUR NAN COUNT : 41519
application_data['AMT_REQ_CREDIT_BUREAU_HOUR'].describe()
count 265992.000000 mean 0.006402 std 0.083849 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 4.000000 Name: AMT_REQ_CREDIT_BUREAU_HOUR, dtype: float64
application_data.isnull().sum()
SK_ID_CURR 0 TARGET 0 NAME_CONTRACT_TYPE 0 CODE_GENDER 0 FLAG_OWN_CAR 0 FLAG_OWN_REALTY 0 CNT_CHILDREN 0 AMT_INCOME_TOTAL 0 AMT_CREDIT 0 AMT_ANNUITY 12 AMT_GOODS_PRICE 278 NAME_TYPE_SUITE 1292 NAME_INCOME_TYPE 0 NAME_EDUCATION_TYPE 0 NAME_FAMILY_STATUS 0 NAME_HOUSING_TYPE 0 REGION_POPULATION_RELATIVE 0 DAYS_BIRTH 0 DAYS_EMPLOYED 0 DAYS_REGISTRATION 0 DAYS_ID_PUBLISH 0 FLAG_MOBIL 0 FLAG_EMP_PHONE 0 FLAG_WORK_PHONE 0 FLAG_CONT_MOBILE 0 FLAG_PHONE 0 FLAG_EMAIL 0 CNT_FAM_MEMBERS 2 REGION_RATING_CLIENT 0 REGION_RATING_CLIENT_W_CITY 0 WEEKDAY_APPR_PROCESS_START 0 HOUR_APPR_PROCESS_START 0 REG_REGION_NOT_LIVE_REGION 0 REG_REGION_NOT_WORK_REGION 0 LIVE_REGION_NOT_WORK_REGION 0 REG_CITY_NOT_LIVE_CITY 0 REG_CITY_NOT_WORK_CITY 0 LIVE_CITY_NOT_WORK_CITY 0 ORGANIZATION_TYPE 0 OBS_30_CNT_SOCIAL_CIRCLE 1021 DEF_30_CNT_SOCIAL_CIRCLE 1021 OBS_60_CNT_SOCIAL_CIRCLE 1021 DEF_60_CNT_SOCIAL_CIRCLE 1021 DAYS_LAST_PHONE_CHANGE 1 FLAG_DOCUMENT_2 0 FLAG_DOCUMENT_3 0 FLAG_DOCUMENT_4 0 FLAG_DOCUMENT_5 0 FLAG_DOCUMENT_6 0 FLAG_DOCUMENT_7 0 FLAG_DOCUMENT_8 0 FLAG_DOCUMENT_9 0 FLAG_DOCUMENT_10 0 FLAG_DOCUMENT_11 0 FLAG_DOCUMENT_12 0 FLAG_DOCUMENT_13 0 FLAG_DOCUMENT_14 0 FLAG_DOCUMENT_15 0 FLAG_DOCUMENT_16 0 FLAG_DOCUMENT_17 0 FLAG_DOCUMENT_18 0 FLAG_DOCUMENT_19 0 FLAG_DOCUMENT_20 0 FLAG_DOCUMENT_21 0 AMT_REQ_CREDIT_BUREAU_HOUR 41519 AMT_REQ_CREDIT_BUREAU_DAY 41519 AMT_REQ_CREDIT_BUREAU_WEEK 41519 AMT_REQ_CREDIT_BUREAU_MON 41519 AMT_REQ_CREDIT_BUREAU_QRT 41519 AMT_REQ_CREDIT_BUREAU_YEAR 41519 dtype: int64
print("AMT_ANNUITY :" ,application_data['AMT_ANNUITY'].isnull().sum())
AMT_ANNUITY : 12
application_data['AMT_ANNUITY'].describe()
count 307499.000000 mean 27108.573909 std 14493.737315 min 1615.500000 25% 16524.000000 50% 24903.000000 75% 34596.000000 max 258025.500000 Name: AMT_ANNUITY, dtype: float64
sns.set_style('whitegrid')
sns.distplot(application_data['AMT_ANNUITY'])
plt.show()
print("AMT_GOODS_PRICE :" ,application_data['AMT_GOODS_PRICE'].isnull().sum())
AMT_GOODS_PRICE : 278
application_data['AMT_GOODS_PRICE'].describe()
count 3.072330e+05 mean 5.383962e+05 std 3.694465e+05 min 4.050000e+04 25% 2.385000e+05 50% 4.500000e+05 75% 6.795000e+05 max 4.050000e+06 Name: AMT_GOODS_PRICE, dtype: float64
sns.set_style('whitegrid')
sns.distplot(application_data['AMT_GOODS_PRICE'])
plt.show()
import statistics
statistics.mode(application_data['DAYS_LAST_PHONE_CHANGE'])
0.0
print(type(application_data.info()))
<class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Data columns (total 70 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SK_ID_CURR 307511 non-null int64 1 TARGET 307511 non-null int64 2 NAME_CONTRACT_TYPE 307511 non-null object 3 CODE_GENDER 307511 non-null object 4 FLAG_OWN_CAR 307511 non-null object 5 FLAG_OWN_REALTY 307511 non-null object 6 CNT_CHILDREN 307511 non-null int64 7 AMT_INCOME_TOTAL 307511 non-null float64 8 AMT_CREDIT 307511 non-null float64 9 AMT_ANNUITY 307499 non-null float64 10 AMT_GOODS_PRICE 307233 non-null float64 11 NAME_TYPE_SUITE 306219 non-null object 12 NAME_INCOME_TYPE 307511 non-null object 13 NAME_EDUCATION_TYPE 307511 non-null object 14 NAME_FAMILY_STATUS 307511 non-null object 15 NAME_HOUSING_TYPE 307511 non-null object 16 REGION_POPULATION_RELATIVE 307511 non-null float64 17 DAYS_BIRTH 307511 non-null int64 18 DAYS_EMPLOYED 307511 non-null int64 19 DAYS_REGISTRATION 307511 non-null float64 20 DAYS_ID_PUBLISH 307511 non-null int64 21 FLAG_MOBIL 307511 non-null int64 22 FLAG_EMP_PHONE 307511 non-null int64 23 FLAG_WORK_PHONE 307511 non-null int64 24 FLAG_CONT_MOBILE 307511 non-null int64 25 FLAG_PHONE 307511 non-null int64 26 FLAG_EMAIL 307511 non-null int64 27 CNT_FAM_MEMBERS 307509 non-null float64 28 REGION_RATING_CLIENT 307511 non-null int64 29 REGION_RATING_CLIENT_W_CITY 307511 non-null int64 30 WEEKDAY_APPR_PROCESS_START 307511 non-null object 31 HOUR_APPR_PROCESS_START 307511 non-null int64 32 REG_REGION_NOT_LIVE_REGION 307511 non-null int64 33 REG_REGION_NOT_WORK_REGION 307511 non-null int64 34 LIVE_REGION_NOT_WORK_REGION 307511 non-null int64 35 REG_CITY_NOT_LIVE_CITY 307511 non-null int64 36 REG_CITY_NOT_WORK_CITY 307511 non-null int64 37 LIVE_CITY_NOT_WORK_CITY 307511 non-null int64 38 ORGANIZATION_TYPE 307511 non-null object 39 OBS_30_CNT_SOCIAL_CIRCLE 306490 non-null float64 40 DEF_30_CNT_SOCIAL_CIRCLE 306490 non-null float64 41 OBS_60_CNT_SOCIAL_CIRCLE 306490 non-null float64 42 DEF_60_CNT_SOCIAL_CIRCLE 306490 non-null float64 43 DAYS_LAST_PHONE_CHANGE 307510 non-null float64 44 FLAG_DOCUMENT_2 307511 non-null int64 45 FLAG_DOCUMENT_3 307511 non-null int64 46 FLAG_DOCUMENT_4 307511 non-null int64 47 FLAG_DOCUMENT_5 307511 non-null int64 48 FLAG_DOCUMENT_6 307511 non-null int64 49 FLAG_DOCUMENT_7 307511 non-null int64 50 FLAG_DOCUMENT_8 307511 non-null int64 51 FLAG_DOCUMENT_9 307511 non-null int64 52 FLAG_DOCUMENT_10 307511 non-null int64 53 FLAG_DOCUMENT_11 307511 non-null int64 54 FLAG_DOCUMENT_12 307511 non-null int64 55 FLAG_DOCUMENT_13 307511 non-null int64 56 FLAG_DOCUMENT_14 307511 non-null int64 57 FLAG_DOCUMENT_15 307511 non-null int64 58 FLAG_DOCUMENT_16 307511 non-null int64 59 FLAG_DOCUMENT_17 307511 non-null int64 60 FLAG_DOCUMENT_18 307511 non-null int64 61 FLAG_DOCUMENT_19 307511 non-null int64 62 FLAG_DOCUMENT_20 307511 non-null int64 63 FLAG_DOCUMENT_21 307511 non-null int64 64 AMT_REQ_CREDIT_BUREAU_HOUR 265992 non-null float64 65 AMT_REQ_CREDIT_BUREAU_DAY 265992 non-null float64 66 AMT_REQ_CREDIT_BUREAU_WEEK 265992 non-null float64 67 AMT_REQ_CREDIT_BUREAU_MON 265992 non-null float64 68 AMT_REQ_CREDIT_BUREAU_QRT 265992 non-null float64 69 AMT_REQ_CREDIT_BUREAU_YEAR 265992 non-null float64 dtypes: float64(18), int64(41), object(11) memory usage: 164.2+ MB <class 'NoneType'>
to make the data more intuitive, interpretable, and easier to work with in subsequent analyses. It ensures that any downstream analysis or machine learning models are not confused by negative values that are purely a result of how the data was originally encoded.
application_data['DAYS_BIRTH'] = abs(application_data['DAYS_BIRTH'])
application_data['DAYS_ID_PUBLISH'] = abs(application_data['DAYS_ID_PUBLISH'])
application_data['DAYS_ID_PUBLISH'] = abs(application_data['DAYS_ID_PUBLISH'])
application_data['DAYS_LAST_PHONE_CHANGE'] = abs(application_data['DAYS_LAST_PHONE_CHANGE'])
display("application_data")
display(application_data.head())
'application_data'
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | 9461 | -637 | -3648.0 | 2120 | 1 | 1 | 0 | 1 | 1 | 0 | 1.0 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 2.0 | 2.0 | 2.0 | 2.0 | 1134.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | 16765 | -1188 | -1186.0 | 291 | 1 | 1 | 0 | 1 | 1 | 0 | 2.0 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 1.0 | 0.0 | 1.0 | 0.0 | 828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | 19046 | -225 | -4260.0 | 2531 | 1 | 1 | 1 | 1 | 1 | 0 | 1.0 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | 0.0 | 0.0 | 0.0 | 0.0 | 815.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | 19005 | -3039 | -9833.0 | 2437 | 1 | 1 | 0 | 1 | 0 | 0 | 2.0 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 2.0 | 0.0 | 2.0 | 0.0 | 617.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | 19932 | -3038 | -4311.0 | 3458 | 1 | 1 | 0 | 1 | 0 | 0 | 1.0 | 2 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 1 | 1 | Religion | 0.0 | 0.0 | 0.0 | 0.0 | 1106.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Separating numerical and categorical in application_data
obj_dtypes = [i for i in application_data.select_dtypes(include=object).columns if i not in ["type"]]
num_dtypes = [i for i in application_data.select_dtypes(include=np.number).columns if i not in ['SK_ID_CURR', 'TARGET']]
print(color.BOLD + color.PURPLE + 'Categorical Columns' + color.END, "\n")
for x in range(len(obj_dtypes)):
print(obj_dtypes[x])
Categorical Columns
NAME_CONTRACT_TYPE
CODE_GENDER
FLAG_OWN_CAR
FLAG_OWN_REALTY
NAME_TYPE_SUITE
NAME_INCOME_TYPE
NAME_EDUCATION_TYPE
NAME_FAMILY_STATUS
NAME_HOUSING_TYPE
WEEKDAY_APPR_PROCESS_START
ORGANIZATION_TYPE
print(color.BOLD + color.PURPLE +"Numerical Columns" + color.END, "\n")
for x in range(len(num_dtypes)):
print(num_dtypes[x])
Numerical Columns
CNT_CHILDREN
AMT_INCOME_TOTAL
AMT_CREDIT
AMT_ANNUITY
AMT_GOODS_PRICE
REGION_POPULATION_RELATIVE
DAYS_BIRTH
DAYS_EMPLOYED
DAYS_REGISTRATION
DAYS_ID_PUBLISH
FLAG_MOBIL
FLAG_EMP_PHONE
FLAG_WORK_PHONE
FLAG_CONT_MOBILE
FLAG_PHONE
FLAG_EMAIL
CNT_FAM_MEMBERS
REGION_RATING_CLIENT
REGION_RATING_CLIENT_W_CITY
HOUR_APPR_PROCESS_START
REG_REGION_NOT_LIVE_REGION
REG_REGION_NOT_WORK_REGION
LIVE_REGION_NOT_WORK_REGION
REG_CITY_NOT_LIVE_CITY
REG_CITY_NOT_WORK_CITY
LIVE_CITY_NOT_WORK_CITY
OBS_30_CNT_SOCIAL_CIRCLE
DEF_30_CNT_SOCIAL_CIRCLE
OBS_60_CNT_SOCIAL_CIRCLE
DEF_60_CNT_SOCIAL_CIRCLE
DAYS_LAST_PHONE_CHANGE
FLAG_DOCUMENT_2
FLAG_DOCUMENT_3
FLAG_DOCUMENT_4
FLAG_DOCUMENT_5
FLAG_DOCUMENT_6
FLAG_DOCUMENT_7
FLAG_DOCUMENT_8
FLAG_DOCUMENT_9
FLAG_DOCUMENT_10
FLAG_DOCUMENT_11
FLAG_DOCUMENT_12
FLAG_DOCUMENT_13
FLAG_DOCUMENT_14
FLAG_DOCUMENT_15
FLAG_DOCUMENT_16
FLAG_DOCUMENT_17
FLAG_DOCUMENT_18
FLAG_DOCUMENT_19
FLAG_DOCUMENT_20
FLAG_DOCUMENT_21
AMT_REQ_CREDIT_BUREAU_HOUR
AMT_REQ_CREDIT_BUREAU_DAY
AMT_REQ_CREDIT_BUREAU_WEEK
AMT_REQ_CREDIT_BUREAU_MON
AMT_REQ_CREDIT_BUREAU_QRT
AMT_REQ_CREDIT_BUREAU_YEAR
Imbalance percentage
fig = plt.figure(figsize=(13,6))
plt.subplot(121)
application_data["CODE_GENDER"].value_counts().plot.pie(autopct = "%1.0f%%",colors = ["red","yellow"],startangle = 60,
wedgeprops={"linewidth":2,"edgecolor":"k"},explode=[.05,0,0],shadow =True)
plt.title("Distribution of gender")
plt.show()
It's non balanced data
TARGET :Target variable (1 - client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in sample, 0 - all other cases)
plt.figure(figsize=(14,7))
plt.subplot(121)
application_data["TARGET"].value_counts().plot.pie(autopct = "%1.0f%%",colors = sns.color_palette("prism",7),startangle = 60,labels=["repayer","defaulter"],
wedgeprops={"linewidth":2,"edgecolor":"k"},explode=[.1,0],shadow =True)
plt.title("Distribution of target variable")
plt.subplot(122)
ax = application_data["TARGET"].value_counts().plot(kind="barh")
for i,j in enumerate(application_data["TARGET"].value_counts().values):
ax.text(.7,i,j,weight = "bold",fontsize=20)
plt.title("Count of target variable")
plt.show()
8% out of total client population have difficulties in repaying loans.
application_data_x = application_data[[x for x in application_data.columns if x not in ["TARGET"]]]
previous_application_x = previous_application[[x for x in previous_application.columns if x not in ["TARGET"]]]
application_data_x["type"] = "application_data"
previous_application_x["type"] = "previous_application"
data = pd.concat([application_data_x,previous_application_x],axis=0)
NAME_CONTRACT_TYPE : Identification if loan is cash , consumer or revolving
plt.figure(figsize=(14,7))
plt.subplot(121)
data[data["type"] == "application_data"]["NAME_CONTRACT_TYPE"].value_counts().plot.pie(autopct = "%1.0f%%",colors = ["orange","red"],startangle = 60,
wedgeprops={"linewidth":2,"edgecolor":"white"},shadow =True)
circ = plt.Circle((0,0),.7,color="white")
plt.gca().add_artist(circ)
plt.title("distribution of contract types in application_data")
plt.subplot(122)
data[data["type"] == "previous_application"]["NAME_CONTRACT_TYPE"].value_counts().plot.pie(autopct = "%1.2f%%",colors = ["red","yellow","green",'BLACK'],startangle = 60,
wedgeprops={"linewidth":2,"edgecolor":"white"},shadow =True)
circ = plt.Circle((0,0),.7,color="white")
plt.gca().add_artist(circ)
plt.ylabel("")
plt.title("distribution of contract types in previous_application")
plt.show()
plt.show()
The percentage of revolving loans and cash loans are 10% & 90%.
Gender Distribution in application_data
fig = plt.figure(figsize=(13,6))
plt.subplot(121)
data[data["type"] == "application_data"]["CODE_GENDER"].value_counts().plot.pie(autopct = "%1.0f%%",colors = ["red","yellow"],startangle = 60,
wedgeprops={"linewidth":2,"edgecolor":"k"},explode=[.05,0,0],shadow =True)
plt.title("distribution of gender in application_data")
plt.show()
Distribution of Contract type by gender
import matplotlib.pyplot as plt
import seaborn as sns
fig = plt.figure(figsize=(13, 6))
plt.subplot(121)
ax = sns.countplot(x="NAME_CONTRACT_TYPE", hue="CODE_GENDER", data=data[data["type"] == "application_data"], palette=["r", "b", "g"])
ax.set_facecolor("lightgrey")
ax.set_title("Distribution of Contract Type by Gender - application_data")
plt.show()
Cash loans is always prefered over Revolving loans by both genders
Distribution of client owning a car and by gender
fig = plt.figure(figsize=(13,6))
plt.subplot(121)
data["FLAG_OWN_CAR"].value_counts().plot.pie(autopct = "%1.0f%%",colors = ["gold","orangered"],startangle = 60,
wedgeprops={"linewidth":2,"edgecolor":"k"},explode=[.05,0],shadow =True)
plt.title("distribution of client owning a car")
plt.subplot(122)
data[data["FLAG_OWN_CAR"] == "Y"]["CODE_GENDER"].value_counts().plot.pie(autopct = "%1.0f%%",colors = ["b","orangered"],startangle = 90,
wedgeprops={"linewidth":2,"edgecolor":"k"},explode=[.05,0,0],shadow =True)
plt.title("distribution of client owning a car by gender")
plt.show()
SUBPLOT 1 : Distribution of client owning a car. 34% of clients own a car .
SUBPLOT 2 : Distribution of client owning a car by gender. Out of total clients who own car 57% are male and 43% are female
Distribution of client owning a house or flat and by gender
plt.figure(figsize=(13,6))
plt.subplot(121)
data["FLAG_OWN_REALTY"].value_counts().plot.pie(autopct = "%1.0f%%",colors = ["skyblue","gold"],startangle = 90,
wedgeprops={"linewidth":2,"edgecolor":"k"},explode=[0.05,0],shadow =True)
plt.title("Distribution of client owns a house or flat")
plt.subplot(122)
data[data["FLAG_OWN_REALTY"] == "Y"]["CODE_GENDER"].value_counts().plot.pie(autopct = "%1.0f%%",colors = ["orangered","b"],startangle = 90,
wedgeprops={"linewidth":2,"edgecolor":"k"},explode=[.05,0,0],shadow =True)
plt.title("Distribution of client owning a house or flat by gender")
plt.show()
SUBPLOT 1 : Distribution of client owning a house or flat . 69% of clients own a flat or house .
SUBPLOT 2 : Distribution of client owning a house or flat by gender . Out of total clients who own house 67% are female and 33% are male.
Distribution of Number of children and family members of client by repayment status.
fig = plt.figure(figsize=(12, 10))
# Plot 1: Distribution of Number of Children by Repayment Status
plt.subplot(211)
sns.countplot(x="CNT_CHILDREN", hue="TARGET", data=application_data, palette="Set1")
plt.legend(loc="upper center")
plt.title("Distribution of Number of Children Client Has by Repayment Status")
# Plot 2: Distribution of Number of Family Members by Repayment Status
plt.subplot(212)
sns.countplot(x="CNT_FAM_MEMBERS", hue="TARGET", data=application_data, palette="Set1")
plt.legend(loc="upper center")
plt.title("Distribution of Number of Family Members Client Has by Repayment Status")
# Set figure background color
fig.set_facecolor("lightblue")
plt.tight_layout()
plt.show()
Distribution of contract type ,gender ,own car ,own house with respect to Repayment status(Target variable)
default = application_data[application_data["TARGET"]==1][[ 'NAME_CONTRACT_TYPE', 'CODE_GENDER','FLAG_OWN_CAR', 'FLAG_OWN_REALTY']]
non_default = application_data[application_data["TARGET"]==0][[ 'NAME_CONTRACT_TYPE', 'CODE_GENDER','FLAG_OWN_CAR', 'FLAG_OWN_REALTY']]
d_cols = ['NAME_CONTRACT_TYPE', 'CODE_GENDER','FLAG_OWN_CAR', 'FLAG_OWN_REALTY']
d_length = len(d_cols)
fig = plt.figure(figsize=(16,4))
for i,j in itertools.zip_longest(d_cols,range(d_length)):
plt.subplot(1,4,j+1)
default[i].value_counts().plot.pie(autopct = "%1.0f%%",colors = sns.color_palette("prism"),startangle = 90,
wedgeprops={"linewidth":1,"edgecolor":"white"},shadow =True)
circ = plt.Circle((0,0),.7,color="white")
plt.gca().add_artist(circ)
plt.ylabel("")
plt.title(i+"-Defaulter")
fig = plt.figure(figsize=(16,4))
for i,j in itertools.zip_longest(d_cols,range(d_length)):
plt.subplot(1,4,j+1)
non_default[i].value_counts().plot.pie(autopct = "%1.0f%%",colors = sns.color_palette("prism",3),startangle = 90,
wedgeprops={"linewidth":1,"edgecolor":"white"},shadow =True)
circ = plt.Circle((0,0),.7,color="white")
plt.gca().add_artist(circ)
plt.ylabel("")
plt.title(i+"-Repayer")
Percentage of males is 10% more in defaults than non defaulters.
Percentage of Cash Loans is 4% more in defaults than Revolving Loans.
AMT_INCOME_TOTAL - Income of the client
AMT_CREDIT - Credit amount of the loan
AMT_ANNUITY - Loan annuity
AMT_GOODS_PRICE - For consumer loans it is the price of the goods for which the loan is given
cols = [ 'AMT_INCOME_TOTAL', 'AMT_CREDIT','AMT_ANNUITY', 'AMT_GOODS_PRICE']
length = len(cols)
cs = ["r","b","g","k"]
ax = plt.figure(figsize=(18,18))
ax.set_facecolor("lightgrey")
for i,j,k in itertools.zip_longest(cols,range(length),cs):
plt.subplot(2,2,j+1)
sns.distplot(data[data[i].notnull()][i],color=k)
plt.axvline(data[i].mean(),label = "mean",linestyle="dashed",color="k")
plt.legend(loc="best")
plt.title(i)
plt.subplots_adjust(hspace = .2)
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
stat = df_new["statistic"].unique().tolist()
length = len(stat)
plt.figure(figsize=(13, 15))
# Loop through each statistic and create a subplot
for i, j in itertools.zip_longest(stat, range(length)):
plt.subplot(2, 2, j + 1)
# Use keyword arguments for x, y, and hue
sns.barplot(x="amount_type", y="amount", hue="type",
data=df_new[df_new["statistic"] == i], palette=["g", "r"])
plt.title(i + " -- Defaulters vs Non-defaulters")
plt.subplots_adjust(hspace=0.4)
# Set the background color of each figure
fig = plt.gcf()
fig.set_facecolor("lightgrey")
plt.show()
1 . Average income of clients who default and who do not are almost same.
2 . Standard deviation in income of client who default is very high compared to who do not default.
3 . Clients who default also has maximum income earnings
Credit amount of the loan ,Loan annuity,Amount goods price - 1 . Statistics between credit amounts,Loan annuity and Amount goods price given to cilents who default and who dont are almost similar.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Plotting
plt.figure(figsize=(12, 5))
ax = sns.barplot(x="amt_type", y="amount", data=df_gen, hue="gender", palette="Set1")
plt.title("Average Income, Credit, Annuity & Goods Price by Gender")
plt.xticks(rotation=45)
plt.show()
Scatter plot between credit amount and annuity amount
fig = plt.figure(figsize=(10,8))
plt.scatter(application_data[application_data["TARGET"]==0]['AMT_ANNUITY'],application_data[application_data["TARGET"]==0]['AMT_CREDIT'],s=35,
color="b",alpha=.5,label="REPAYER",linewidth=.5,edgecolor="k")
plt.scatter(application_data[application_data["TARGET"]==1]['AMT_ANNUITY'],application_data[application_data["TARGET"]==1]['AMT_CREDIT'],s=35,
color="r",alpha=.2,label="DEFAULTER",linewidth=.5,edgecolor="k")
plt.legend(loc="best",prop={"size":15})
plt.xlabel("AMT_ANNUITY")
plt.ylabel("AMT_CREDIT")
plt.title("Scatter plot between credit amount and annuity amount")
plt.show()
AMT_INCOME_TOTAL - Income of the client
AMT_CREDIT - Credit amount of the loan
AMT_ANNUITY - Loan annuity
AMT_GOODS_PRICE - For consumer loans it is the price of the goods for which the loan is given
amt = application_data[[ 'AMT_INCOME_TOTAL','AMT_CREDIT',
'AMT_ANNUITY', 'AMT_GOODS_PRICE',"TARGET"]]
amt = amt[(amt["AMT_GOODS_PRICE"].notnull()) & (amt["AMT_ANNUITY"].notnull())]
sns.pairplot(amt,hue="TARGET",palette=["b","r"])
plt.show()
NAME_TYPE_SUITE - Who was accompanying client when he was applying for the loan.
plt.figure(figsize=(18,12))
plt.subplot(121)
sns.countplot(y=data["NAME_TYPE_SUITE"],
palette="Set2",
order=data["NAME_TYPE_SUITE"].value_counts().index[:5])
plt.title("Distribution of Suite type")
plt.subplot(122)
sns.countplot(y=data["NAME_TYPE_SUITE"],
hue=data["CODE_GENDER"],palette="Set2",
order=data["NAME_TYPE_SUITE"].value_counts().index[:5])
plt.ylabel("")
plt.title("Distribution of Suite type by gender")
plt.subplots_adjust(wspace = .4)
NAME_INCOME_TYPE Clients income type (businessman, working, maternity leave,…)
plt.figure(figsize=(18,12))
plt.subplot(121)
sns.countplot(y=data["NAME_INCOME_TYPE"],
palette="Set2",
order=data["NAME_INCOME_TYPE"].value_counts().index[:4])
plt.title("Distribution of client income type")
plt.subplot(122)
sns.countplot(y=data["NAME_INCOME_TYPE"],
hue=data["CODE_GENDER"],
palette="Set2",
order=data["NAME_INCOME_TYPE"].value_counts().index[:4])
plt.ylabel("")
plt.title("Distribution of client income type by gender")
plt.subplots_adjust(wspace = .4)
cols = ['DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION','DAYS_LAST_DUE', 'DAYS_TERMINATION']
plt.figure(figsize=(12,6))
sns.heatmap(previous_application[cols].describe()[1:].transpose(),
annot=True,linewidth=2,linecolor="k",cmap=sns.color_palette("inferno"))
plt.show()
import matplotlib.pyplot as plt
import seaborn as sns
# Select only numeric columns for correlation calculation
numeric_cols = application_data.select_dtypes(include='number')
# Calculate the correlation matrix
corrmat = numeric_cols.corr()
# Plot the heatmap
f, ax = plt.subplots(figsize=(8, 8))
sns.heatmap(corrmat, ax=ax, cmap="rainbow", annot=True, fmt=".2f")
plt.title("Correlation Matrix - Application Data")
plt.show()
import matplotlib.pyplot as plt
import seaborn as sns
# Select only numeric columns for correlation calculation
numeric_cols = previous_application.select_dtypes(include='number')
# Calculate the correlation matrix
corrmat = numeric_cols.corr()
# Plot the heatmap
f, ax = plt.subplots(figsize=(8, 8))
sns.heatmap(corrmat, ax=ax, cmap="rainbow", annot=True, fmt=".2f")
plt.title("Correlation Matrix - Previous Application Data")
plt.show()
import numpy as np
# Select only numeric columns
numeric_cols = previous_application.select_dtypes(include='number')
# Calculate the correlation matrix for numeric columns
corrmat = numeric_cols.corr()
# Create a DataFrame to store the correlation values above the diagonal
corrdf = corrmat.where(np.triu(np.ones(corrmat.shape), k=1).astype(bool))
# Unstack the correlation matrix and reset the index
corrdf = corrdf.unstack().reset_index()
corrdf.columns = ['Var1', 'Var2', 'Correlation']
# Drop NaN values (which are not correlations)
corrdf.dropna(subset=['Correlation'], inplace=True)
# Round the correlation values and convert to absolute values
corrdf['Correlation'] = round(corrdf['Correlation'], 2)
corrdf['Correlation'] = abs(corrdf['Correlation'])
# Sort the correlations by their absolute values in descending order
top_correlations = corrdf.sort_values(by='Correlation', ascending=False).head(10)
# Display the top 10 correlations
print(top_correlations)
Var1 Var2 Correlation 88 AMT_GOODS_PRICE AMT_APPLICATION 1.00 89 AMT_GOODS_PRICE AMT_CREDIT 0.99 71 AMT_CREDIT AMT_APPLICATION 0.98 269 DAYS_TERMINATION DAYS_LAST_DUE 0.93 87 AMT_GOODS_PRICE AMT_ANNUITY 0.82 70 AMT_CREDIT AMT_ANNUITY 0.82 53 AMT_APPLICATION AMT_ANNUITY 0.81 232 DAYS_LAST_DUE_1ST_VERSION DAYS_FIRST_DRAWING 0.80 173 CNT_PAYMENT AMT_APPLICATION 0.68 174 CNT_PAYMENT AMT_CREDIT 0.67
df_repayer = application_data[application_data['TARGET'] == 0]
df_defaulter = application_data[application_data['TARGET'] == 1]
import numpy as np
# Select only numeric columns from df_repayer
numeric_cols_repayer = df_repayer.select_dtypes(include='number')
# Calculate the correlation matrix for numeric columns
corrmat = numeric_cols_repayer.corr()
# Create a DataFrame to store the correlation values above the diagonal
corrdf = corrmat.where(np.triu(np.ones(corrmat.shape), k=1).astype(bool))
# Unstack the correlation matrix and reset the index
corrdf = corrdf.unstack().reset_index()
corrdf.columns = ['Var1', 'Var2', 'Correlation']
# Drop NaN values (which are not correlations)
corrdf.dropna(subset=['Correlation'], inplace=True)
# Round the correlation values and convert to absolute values
corrdf['Correlation'] = round(corrdf['Correlation'], 2)
corrdf['Correlation'] = abs(corrdf['Correlation'])
# Sort the correlations by their absolute values in descending order
top_correlations_repayer = corrdf.sort_values(by='Correlation', ascending=False).head(10)
# Display the top 10 correlations
print(top_correlations_repayer)
Var1 Var2 Correlation 776 FLAG_EMP_PHONE DAYS_EMPLOYED 1.00 1798 OBS_60_CNT_SOCIAL_CIRCLE OBS_30_CNT_SOCIAL_CIRCLE 1.00 358 AMT_GOODS_PRICE AMT_CREDIT 0.99 1199 REGION_RATING_CLIENT_W_CITY REGION_RATING_CLIENT 0.95 1064 CNT_FAM_MEMBERS CNT_CHILDREN 0.88 1858 DEF_60_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE 0.86 1439 LIVE_REGION_NOT_WORK_REGION REG_REGION_NOT_WORK_REGION 0.86 1619 LIVE_CITY_NOT_WORK_CITY REG_CITY_NOT_WORK_CITY 0.83 359 AMT_GOODS_PRICE AMT_ANNUITY 0.78 299 AMT_ANNUITY AMT_CREDIT 0.77
Top 10 Correlation Fields for Defaulter
import numpy as np
# Select only numeric columns from df_defaulter
numeric_cols_defaulter = df_defaulter.select_dtypes(include='number')
# Calculate the correlation matrix for numeric columns
corrmat = numeric_cols_defaulter.corr()
# Create a DataFrame to store the correlation values above the diagonal
corrdf = corrmat.where(np.triu(np.ones(corrmat.shape), k=1).astype(bool))
# Unstack the correlation matrix and reset the index
corrdf = corrdf.unstack().reset_index()
corrdf.columns = ['Var1', 'Var2', 'Correlation']
# Drop NaN values (which are not correlations)
corrdf.dropna(subset=['Correlation'], inplace=True)
# Round the correlation values and convert to absolute values
corrdf['Correlation'] = round(corrdf['Correlation'], 2)
corrdf['Correlation'] = abs(corrdf['Correlation'])
# Sort the correlations by their absolute values in descending order
top_correlations_defaulter = corrdf.sort_values(by='Correlation', ascending=False).head(10)
# Display the top 10 correlations
print(top_correlations_defaulter)
Var1 Var2 Correlation 1798 OBS_60_CNT_SOCIAL_CIRCLE OBS_30_CNT_SOCIAL_CIRCLE 1.00 776 FLAG_EMP_PHONE DAYS_EMPLOYED 1.00 358 AMT_GOODS_PRICE AMT_CREDIT 0.98 1199 REGION_RATING_CLIENT_W_CITY REGION_RATING_CLIENT 0.96 1064 CNT_FAM_MEMBERS CNT_CHILDREN 0.89 1858 DEF_60_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE 0.87 1439 LIVE_REGION_NOT_WORK_REGION REG_REGION_NOT_WORK_REGION 0.85 1619 LIVE_CITY_NOT_WORK_CITY REG_CITY_NOT_WORK_CITY 0.78 299 AMT_ANNUITY AMT_CREDIT 0.75 359 AMT_GOODS_PRICE AMT_ANNUITY 0.75
mergeddf = pd.merge(application_data,previous_application,on='SK_ID_CURR')
mergeddf.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE_x | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT_x | AMT_ANNUITY_x | AMT_GOODS_PRICE_x | NAME_TYPE_SUITE_x | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START_x | HOUR_APPR_PROCESS_START_x | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | SK_ID_PREV | NAME_CONTRACT_TYPE_y | AMT_ANNUITY_y | AMT_APPLICATION | AMT_CREDIT_y | AMT_GOODS_PRICE_y | WEEKDAY_APPR_PROCESS_START_y | HOUR_APPR_PROCESS_START_y | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_TYPE_SUITE_y | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | 9461 | -637 | -3648.0 | 2120 | 1 | 1 | 0 | 1 | 1 | 0 | 1.0 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 2.0 | 2.0 | 2.0 | 2.0 | 1134.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1038818 | Consumer loans | 9251.775 | 179055.0 | 179055.0 | 179055.0 | SATURDAY | 9 | Y | 1 | XAP | Approved | -606 | XNA | XAP | NaN | New | Vehicles | POS | XNA | Stone | 500 | Auto technology | 24.0 | low_normal | POS other with interest | 365243.0 | -565.0 | 125.0 | -25.0 | -17.0 | 0.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | 16765 | -1188 | -1186.0 | 291 | 1 | 1 | 0 | 1 | 1 | 0 | 2.0 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 1.0 | 0.0 | 1.0 | 0.0 | 828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1810518 | Cash loans | 98356.995 | 900000.0 | 1035882.0 | 900000.0 | FRIDAY | 12 | Y | 1 | XNA | Approved | -746 | XNA | XAP | Unaccompanied | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | low_normal | Cash X-Sell: low | 365243.0 | -716.0 | -386.0 | -536.0 | -527.0 | 1.0 |
| 2 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | 16765 | -1188 | -1186.0 | 291 | 1 | 1 | 0 | 1 | 1 | 0 | 2.0 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 1.0 | 0.0 | 1.0 | 0.0 | 828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2636178 | Consumer loans | 64567.665 | 337500.0 | 348637.5 | 337500.0 | SUNDAY | 17 | Y | 1 | XAP | Approved | -828 | Cash through the bank | XAP | Family | Refreshed | Furniture | POS | XNA | Stone | 1400 | Furniture | 6.0 | middle | POS industry with interest | 365243.0 | -797.0 | -647.0 | -647.0 | -639.0 | 0.0 |
| 3 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | 16765 | -1188 | -1186.0 | 291 | 1 | 1 | 0 | 1 | 1 | 0 | 2.0 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 1.0 | 0.0 | 1.0 | 0.0 | 828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2396755 | Consumer loans | 6737.310 | 68809.5 | 68053.5 | 68809.5 | SATURDAY | 15 | Y | 1 | XAP | Approved | -2341 | Cash through the bank | XAP | Family | Refreshed | Consumer Electronics | POS | XNA | Country-wide | 200 | Consumer electronics | 12.0 | middle | POS household with interest | 365243.0 | -2310.0 | -1980.0 | -1980.0 | -1976.0 | 1.0 |
| 4 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | 19046 | -225 | -4260.0 | 2531 | 1 | 1 | 1 | 1 | 1 | 0 | 1.0 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | 0.0 | 0.0 | 0.0 | 0.0 | 815.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1564014 | Consumer loans | 5357.250 | 24282.0 | 20106.0 | 24282.0 | FRIDAY | 5 | Y | 1 | XAP | Approved | -815 | Cash through the bank | XAP | Unaccompanied | New | Mobile | POS | XNA | Regional / Local | 30 | Connectivity | 4.0 | middle | POS mobile without interest | 365243.0 | -784.0 | -694.0 | -724.0 | -714.0 | 0.0 |
y = mergeddf.groupby('SK_ID_CURR').size()
dfA = mergeddf.groupby('SK_ID_CURR').agg({'TARGET': np.sum})
dfA['count'] = y
display(dfA.head(10))
| TARGET | count | |
|---|---|---|
| SK_ID_CURR | ||
| 100002 | 1 | 1 |
| 100003 | 0 | 3 |
| 100004 | 0 | 1 |
| 100006 | 0 | 9 |
| 100007 | 0 | 6 |
| 100008 | 0 | 5 |
| 100009 | 0 | 7 |
| 100010 | 0 | 1 |
| 100011 | 0 | 4 |
| 100012 | 0 | 4 |
dfA.sort_values(by = 'count',ascending=False).head(10)
| TARGET | count | |
|---|---|---|
| SK_ID_CURR | ||
| 265681 | 0 | 73 |
| 173680 | 0 | 72 |
| 242412 | 0 | 68 |
| 206783 | 0 | 67 |
| 389950 | 0 | 64 |
| 382179 | 0 | 64 |
| 198355 | 0 | 63 |
| 345161 | 0 | 62 |
| 446486 | 0 | 62 |
| 238250 | 0 | 61 |
df_repayer = dfA[dfA['TARGET'] == 0]
df_defaulter = dfA[dfA['TARGET'] == 1]
Repayers' Borrowing History
df_repayer.sort_values(by = 'count',ascending=False).head(10)
| TARGET | count | |
|---|---|---|
| SK_ID_CURR | ||
| 265681 | 0 | 73 |
| 173680 | 0 | 72 |
| 242412 | 0 | 68 |
| 206783 | 0 | 67 |
| 382179 | 0 | 64 |
| 389950 | 0 | 64 |
| 198355 | 0 | 63 |
| 446486 | 0 | 62 |
| 345161 | 0 | 62 |
| 280586 | 0 | 61 |
df_defaulter.sort_values(by = 'count',ascending=False).head(10)
| TARGET | count | |
|---|---|---|
| SK_ID_CURR | ||
| 100002 | 1 | 1 |
| 333349 | 1 | 1 |
| 333587 | 1 | 1 |
| 333582 | 1 | 1 |
| 333534 | 1 | 1 |
| 333506 | 1 | 1 |
| 333419 | 1 | 1 |
| 333355 | 1 | 1 |
| 333337 | 1 | 1 |
| 334761 | 1 | 1 |
application_data=pd.read_csv("application_data.csv")
previous_application=pd.read_csv("previous_application.csv")
columns_description=pd.read_csv("columns_description.csv",encoding='ISO-8859-1')
# The file I'm trying to read might not actually be encoded in utf-8. It could be using another encoding like ISO-8859-1
application_data=application_data.drop([ 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG',
'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG',
'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG',
'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG',
'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE',
'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE',
'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE',
'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE',
'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI',
'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI',
'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI',
'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI',
'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI',
'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'TOTALAREA_MODE',
'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE',"OWN_CAR_AGE","OCCUPATION_TYPE"],axis=1)
previous_application=previous_application.drop([ 'AMT_DOWN_PAYMENT', 'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
"RATE_INTEREST_PRIVILEGED"],axis=1)
pd.set_option("display.max_rows", None, "display.max_columns", None)
combined_df= pd.merge(application_data, previous_application, on='SK_ID_CURR', how='inner')
combined_df.sort_values(by=['SK_ID_CURR','SK_ID_PREV'],ascending=[True,True],inplace=True)
display(combined_df.head(10))
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE_x | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT_x | AMT_ANNUITY_x | AMT_GOODS_PRICE_x | NAME_TYPE_SUITE_x | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START_x | HOUR_APPR_PROCESS_START_x | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | SK_ID_PREV | NAME_CONTRACT_TYPE_y | AMT_ANNUITY_y | AMT_APPLICATION | AMT_CREDIT_y | AMT_GOODS_PRICE_y | WEEKDAY_APPR_PROCESS_START_y | HOUR_APPR_PROCESS_START_y | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_TYPE_SUITE_y | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -637 | -3648.0 | -2120 | 1 | 1 | 0 | 1 | 1 | 0 | 1.0 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 2.0 | 2.0 | 2.0 | 2.0 | -1134.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1038818 | Consumer loans | 9251.775 | 179055.0 | 179055.0 | 179055.0 | SATURDAY | 9 | Y | 1 | XAP | Approved | -606 | XNA | XAP | NaN | New | Vehicles | POS | XNA | Stone | 500 | Auto technology | 24.0 | low_normal | POS other with interest | 365243.0 | -565.0 | 125.0 | -25.0 | -17.0 | 0.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1188 | -1186.0 | -291 | 1 | 1 | 0 | 1 | 1 | 0 | 2.0 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 1.0 | 0.0 | 1.0 | 0.0 | -828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1810518 | Cash loans | 98356.995 | 900000.0 | 1035882.0 | 900000.0 | FRIDAY | 12 | Y | 1 | XNA | Approved | -746 | XNA | XAP | Unaccompanied | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | low_normal | Cash X-Sell: low | 365243.0 | -716.0 | -386.0 | -536.0 | -527.0 | 1.0 |
| 3 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1188 | -1186.0 | -291 | 1 | 1 | 0 | 1 | 1 | 0 | 2.0 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 1.0 | 0.0 | 1.0 | 0.0 | -828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2396755 | Consumer loans | 6737.310 | 68809.5 | 68053.5 | 68809.5 | SATURDAY | 15 | Y | 1 | XAP | Approved | -2341 | Cash through the bank | XAP | Family | Refreshed | Consumer Electronics | POS | XNA | Country-wide | 200 | Consumer electronics | 12.0 | middle | POS household with interest | 365243.0 | -2310.0 | -1980.0 | -1980.0 | -1976.0 | 1.0 |
| 2 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1188 | -1186.0 | -291 | 1 | 1 | 0 | 1 | 1 | 0 | 2.0 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 1.0 | 0.0 | 1.0 | 0.0 | -828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2636178 | Consumer loans | 64567.665 | 337500.0 | 348637.5 | 337500.0 | SUNDAY | 17 | Y | 1 | XAP | Approved | -828 | Cash through the bank | XAP | Family | Refreshed | Furniture | POS | XNA | Stone | 1400 | Furniture | 6.0 | middle | POS industry with interest | 365243.0 | -797.0 | -647.0 | -647.0 | -639.0 | 0.0 |
| 4 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -225 | -4260.0 | -2531 | 1 | 1 | 1 | 1 | 1 | 0 | 1.0 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | 0.0 | 0.0 | 0.0 | 0.0 | -815.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1564014 | Consumer loans | 5357.250 | 24282.0 | 20106.0 | 24282.0 | FRIDAY | 5 | Y | 1 | XAP | Approved | -815 | Cash through the bank | XAP | Unaccompanied | New | Mobile | POS | XNA | Regional / Local | 30 | Connectivity | 4.0 | middle | POS mobile without interest | 365243.0 | -784.0 | -694.0 | -724.0 | -714.0 | 0.0 |
| 9 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -3039 | -9833.0 | -2437 | 1 | 1 | 0 | 1 | 0 | 0 | 2.0 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 2.0 | 0.0 | 2.0 | 0.0 | -617.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 1020698 | Cash loans | 39954.510 | 454500.0 | 481495.5 | 454500.0 | SATURDAY | 12 | Y | 1 | XNA | Approved | -438 | Cash through the bank | XAP | NaN | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 18.0 | high | Cash X-Sell: high | NaN | NaN | NaN | NaN | NaN | NaN |
| 10 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -3039 | -9833.0 | -2437 | 1 | 1 | 0 | 1 | 0 | 0 | 2.0 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 2.0 | 0.0 | 2.0 | 0.0 | -617.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 1243599 | Cash loans | NaN | 0.0 | 0.0 | NaN | THURSDAY | 15 | Y | 1 | XNA | Canceled | -181 | XNA | XAP | NaN | Repeater | XNA | XNA | XNA | Credit and cash offices | -1 | XNA | NaN | XNA | Cash | NaN | NaN | NaN | NaN | NaN | NaN |
| 8 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -3039 | -9833.0 | -2437 | 1 | 1 | 0 | 1 | 0 | 0 | 2.0 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 2.0 | 0.0 | 2.0 | 0.0 | -617.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 1489396 | Revolving loans | 13500.000 | 270000.0 | 270000.0 | 270000.0 | THURSDAY | 15 | Y | 1 | XAP | Approved | -181 | XNA | XAP | Unaccompanied | Repeater | XNA | Cards | x-sell | Credit and cash offices | -1 | XNA | 0.0 | XNA | Card X-Sell | 365243.0 | 365243.0 | 365243.0 | 365243.0 | 365243.0 | 0.0 |
| 13 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -3039 | -9833.0 | -2437 | 1 | 1 | 0 | 1 | 0 | 0 | 2.0 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 2.0 | 0.0 | 2.0 | 0.0 | -617.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 1697039 | Cash loans | 32696.100 | 688500.0 | 906615.0 | 688500.0 | THURSDAY | 15 | Y | 1 | XNA | Refused | -181 | Cash through the bank | LIMIT | Unaccompanied | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 48.0 | low_normal | Cash X-Sell: low | NaN | NaN | NaN | NaN | NaN | NaN |
| 5 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -3039 | -9833.0 | -2437 | 1 | 1 | 0 | 1 | 0 | 0 | 2.0 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 2.0 | 0.0 | 2.0 | 0.0 | -617.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 2078043 | Cash loans | 24246.000 | 675000.0 | 675000.0 | 675000.0 | THURSDAY | 15 | Y | 1 | XNA | Approved | -181 | Cash through the bank | XAP | Unaccompanied | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 48.0 | low_normal | Cash X-Sell: low | 365243.0 | -151.0 | 1259.0 | -151.0 | -143.0 | 0.0 |
combined_df.drop_duplicates(subset='SK_ID_CURR',inplace=True,keep='last')
combined_df.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE_x | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT_x | AMT_ANNUITY_x | AMT_GOODS_PRICE_x | NAME_TYPE_SUITE_x | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START_x | HOUR_APPR_PROCESS_START_x | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | SK_ID_PREV | NAME_CONTRACT_TYPE_y | AMT_ANNUITY_y | AMT_APPLICATION | AMT_CREDIT_y | AMT_GOODS_PRICE_y | WEEKDAY_APPR_PROCESS_START_y | HOUR_APPR_PROCESS_START_y | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_TYPE_SUITE_y | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -637 | -3648.0 | -2120 | 1 | 1 | 0 | 1 | 1 | 0 | 1.0 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 2.0 | 2.0 | 2.0 | 2.0 | -1134.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1038818 | Consumer loans | 9251.775 | 179055.0 | 179055.0 | 179055.0 | SATURDAY | 9 | Y | 1 | XAP | Approved | -606 | XNA | XAP | NaN | New | Vehicles | POS | XNA | Stone | 500 | Auto technology | 24.0 | low_normal | POS other with interest | 365243.0 | -565.0 | 125.0 | -25.0 | -17.0 | 0.0 |
| 2 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1188 | -1186.0 | -291 | 1 | 1 | 0 | 1 | 1 | 0 | 2.0 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 1.0 | 0.0 | 1.0 | 0.0 | -828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2636178 | Consumer loans | 64567.665 | 337500.0 | 348637.5 | 337500.0 | SUNDAY | 17 | Y | 1 | XAP | Approved | -828 | Cash through the bank | XAP | Family | Refreshed | Furniture | POS | XNA | Stone | 1400 | Furniture | 6.0 | middle | POS industry with interest | 365243.0 | -797.0 | -647.0 | -647.0 | -639.0 | 0.0 |
| 4 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -225 | -4260.0 | -2531 | 1 | 1 | 1 | 1 | 1 | 0 | 1.0 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | 0.0 | 0.0 | 0.0 | 0.0 | -815.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1564014 | Consumer loans | 5357.250 | 24282.0 | 20106.0 | 24282.0 | FRIDAY | 5 | Y | 1 | XAP | Approved | -815 | Cash through the bank | XAP | Unaccompanied | New | Mobile | POS | XNA | Regional / Local | 30 | Connectivity | 4.0 | middle | POS mobile without interest | 365243.0 | -784.0 | -694.0 | -724.0 | -714.0 | 0.0 |
| 6 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -3039 | -9833.0 | -2437 | 1 | 1 | 0 | 1 | 0 | 0 | 2.0 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 2.0 | 0.0 | 2.0 | 0.0 | -617.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 2827850 | Revolving loans | NaN | 0.0 | 0.0 | NaN | THURSDAY | 15 | Y | 1 | XAP | Canceled | -181 | XNA | XAP | NaN | Repeater | XNA | XNA | XNA | Credit and cash offices | -1 | XNA | NaN | XNA | Card Street | NaN | NaN | NaN | NaN | NaN | NaN |
| 15 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | -19932 | -3038 | -4311.0 | -3458 | 1 | 1 | 0 | 1 | 0 | 0 | 1.0 | 2 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 1 | 1 | Religion | 0.0 | 0.0 | 0.0 | 0.0 | -1106.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2730157 | Cash loans | 13010.985 | 225000.0 | 284400.0 | 225000.0 | FRIDAY | 13 | Y | 1 | XNA | Approved | -867 | Cash through the bank | XAP | Family | Repeater | XNA | Cash | x-sell | AP+ (Cash loan) | 4 | XNA | 48.0 | middle | Cash X-Sell: middle | NaN | NaN | NaN | NaN | NaN | NaN |
combined_df['DAYS_BIRTH'] = abs(combined_df['DAYS_BIRTH'])
combined_df['DAYS_ID_PUBLISH'] = abs(combined_df['DAYS_ID_PUBLISH'])
combined_df['DAYS_ID_PUBLISH'] = abs(combined_df['DAYS_ID_PUBLISH'])
combined_df['DAYS_LAST_PHONE_CHANGE'] = abs(combined_df['DAYS_LAST_PHONE_CHANGE'])
# Select only numeric columns from the combined_df
numeric_cols_combined = combined_df.select_dtypes(include='number')
# Calculate the correlation matrix for numeric columns
corrmat = numeric_cols_combined.corr()
# Create a heatmap for the correlation matrix
f, ax = plt.subplots(figsize=(8, 8))
sns.heatmap(corrmat, ax=ax, cmap="rainbow")
plt.show()
Top 10 Correlation Fields
# Select only the numeric columns in the DataFrame
numeric_cols_combined = combined_df.select_dtypes(include='number')
# Compute the correlation matrix for numeric columns only
corrmat = numeric_cols_combined.corr()
# Apply the upper triangle mask to avoid duplicate correlations
corrdf = corrmat.where(np.triu(np.ones(corrmat.shape), k=1).astype(bool))
# Unstack the correlation matrix and reset index
corrdf = corrdf.unstack().reset_index()
# Rename columns for clarity
corrdf.columns = ['Var1', 'Var2', 'Correlation']
# Drop rows with NaN values (those from the lower triangle)
corrdf.dropna(subset=['Correlation'], inplace=True)
# Round the correlation values to two decimal places and take the absolute value
corrdf['Correlation'] = round(corrdf['Correlation'], 2)
corrdf['Correlation'] = abs(corrdf['Correlation'])
# Sort by the correlation values in descending order and get the top 10
top_10_corr = corrdf.sort_values(by='Correlation', ascending=False).head(10)
# Display the top 10 correlations
top_10_corr
| Var1 | Var2 | Correlation | |
|---|---|---|---|
| 2278 | OBS_60_CNT_SOCIAL_CIRCLE | OBS_30_CNT_SOCIAL_CIRCLE | 1.00 |
| 4786 | AMT_GOODS_PRICE_y | AMT_APPLICATION | 1.00 |
| 984 | FLAG_EMP_PHONE | DAYS_EMPLOYED | 1.00 |
| 4787 | AMT_GOODS_PRICE_y | AMT_CREDIT_y | 0.99 |
| 454 | AMT_GOODS_PRICE_x | AMT_CREDIT_x | 0.99 |
| 4711 | AMT_CREDIT_y | AMT_APPLICATION | 0.97 |
| 1519 | REGION_RATING_CLIENT_W_CITY | REGION_RATING_CLIENT | 0.95 |
| 5547 | DAYS_TERMINATION | DAYS_LAST_DUE | 0.93 |
| 1352 | CNT_FAM_MEMBERS | CNT_CHILDREN | 0.88 |
| 1823 | LIVE_REGION_NOT_WORK_REGION | REG_REGION_NOT_WORK_REGION | 0.87 |
Imbalace percentage
fig = plt.figure(figsize=(13,6))
plt.subplot(121)
combined_df["CODE_GENDER"].value_counts().plot.pie(autopct = "%1.0f%%",colors = ["red","yellow"],startangle = 60,
wedgeprops={"linewidth":2,"edgecolor":"k"},explode=[.05,0,0],shadow =True)
plt.title("Distribution of gender")
plt.show()
Distribution of Target variable
plt.figure(figsize=(14,7))
plt.subplot(121)
combined_df["TARGET"].value_counts().plot.pie(autopct = "%1.0f%%",colors = sns.color_palette("prism",7),startangle = 60,labels=["repayer","defaulter"],
wedgeprops={"linewidth":2,"edgecolor":"k"},explode=[.1,0],shadow =True)
plt.title("Distribution of target variable")
plt.subplot(122)
ax = combined_df["TARGET"].value_counts().plot(kind="barh")
for i,j in enumerate(combined_df["TARGET"].value_counts().values):
ax.text(.7,i,j,weight = "bold",fontsize=20)
plt.title("Count of target variable")
plt.show()
Distribution in Contract types
plt.figure(figsize=(14,7))
plt.subplot(121)
combined_df["NAME_CONTRACT_TYPE_x"].value_counts().plot.pie(autopct = "%1.0f%%",colors = ["orange","red"],startangle = 60,
wedgeprops={"linewidth":2,"edgecolor":"white"},shadow =True)
circ = plt.Circle((0,0),.7,color="white")
plt.gca().add_artist(circ)
plt.title("distribution of contract types in combined_df")
plt.show()
Gender Distribution
fig = plt.figure(figsize=(13,6))
plt.subplot(121)
combined_df["CODE_GENDER"].value_counts().plot.pie(autopct = "%1.0f%%",colors = ["red","yellow"],startangle = 60,
wedgeprops={"linewidth":2,"edgecolor":"k"},explode=[.05,0,0],shadow =True)
plt.title("Distribution of gender")
plt.show()
Distribution of Contract type by gender
fig = plt.figure(figsize=(13,6))
plt.subplot(121)
ax = sns.countplot(x="NAME_CONTRACT_TYPE_x", hue="CODE_GENDER", data=combined_df, palette=["r", "b", "g"])
ax.set_facecolor("lightgrey")
ax.set_title("Distribution of Contract type by gender - application_data")
plt.show()
Distribution of client owning a car and by gender
fig = plt.figure(figsize=(13,6))
plt.subplot(121)
combined_df["FLAG_OWN_CAR"].value_counts().plot.pie(autopct = "%1.0f%%",colors = ["gold","orangered"],startangle = 60,
wedgeprops={"linewidth":2,"edgecolor":"k"},explode=[.05,0],shadow =True)
plt.title("distribution of client owning a car")
plt.subplot(122)
combined_df[combined_df["FLAG_OWN_CAR"] == "Y"]["CODE_GENDER"].value_counts().plot.pie(autopct = "%1.0f%%",colors = ["b","orangered"],startangle = 90,
wedgeprops={"linewidth":2,"edgecolor":"k"},explode=[.05,0,0],shadow =True)
plt.title("distribution of client owning a car by gender")
plt.show()
Distribution of client owning a house or flat and by gender
plt.figure(figsize=(13,6))
plt.subplot(121)
combined_df["FLAG_OWN_REALTY"].value_counts().plot.pie(autopct = "%1.0f%%",colors = ["skyblue","gold"],startangle = 90,
wedgeprops={"linewidth":2,"edgecolor":"k"},explode=[0.05,0],shadow =True)
plt.title("Distribution of client owns a house or flat")
plt.subplot(122)
combined_df[combined_df["FLAG_OWN_REALTY"] == "Y"]["CODE_GENDER"].value_counts().plot.pie(autopct = "%1.0f%%",colors = ["orangered","b"],startangle = 90,
wedgeprops={"linewidth":2,"edgecolor":"k"},explode=[.05,0,0],shadow =True)
plt.title("Distribution of client owning a house or flat by gender")
plt.show()
Distribution of Number of children and family members of client by repayment status.
CNT_CHILDREN - Number of children the client has.
CNT_FAM_MEMBERS - How many family members does client have.
fig = plt.figure(figsize=(12,10))
# First subplot
plt.subplot(211)
sns.countplot(x="CNT_CHILDREN", hue="TARGET", data=combined_df, palette="Set1")
plt.legend(loc="upper center")
plt.title("Distribution of Number of children client has by repayment status")
# Second subplot
plt.subplot(212)
sns.countplot(x="CNT_FAM_MEMBERS", hue="TARGET", data=combined_df, palette="Set1")
plt.legend(loc="upper center")
plt.title("Distribution of Number of family members client has by repayment status")
fig.set_facecolor("lightblue")
plt.show()
Distribution of contract type ,gender ,own car ,own house with respect to Repayment status(Target variable)
default = combined_df[combined_df["TARGET"]==1][[ 'NAME_CONTRACT_TYPE_x', 'CODE_GENDER','FLAG_OWN_CAR', 'FLAG_OWN_REALTY']]
non_default = combined_df[combined_df["TARGET"]==0][[ 'NAME_CONTRACT_TYPE_x', 'CODE_GENDER','FLAG_OWN_CAR', 'FLAG_OWN_REALTY']]
d_cols = ['NAME_CONTRACT_TYPE_x', 'CODE_GENDER','FLAG_OWN_CAR', 'FLAG_OWN_REALTY']
d_length = len(d_cols)
fig = plt.figure(figsize=(16,4))
for i,j in itertools.zip_longest(d_cols,range(d_length)):
plt.subplot(1,4,j+1)
default[i].value_counts().plot.pie(autopct = "%1.0f%%",colors = sns.color_palette("prism"),startangle = 90,
wedgeprops={"linewidth":1,"edgecolor":"white"},shadow =True)
circ = plt.Circle((0,0),.7,color="white")
plt.gca().add_artist(circ)
plt.ylabel("")
plt.title(i+"-Defaulter")
fig = plt.figure(figsize=(16,4))
for i,j in itertools.zip_longest(d_cols,range(d_length)):
plt.subplot(1,4,j+1)
non_default[i].value_counts().plot.pie(autopct = "%1.0f%%",colors = sns.color_palette("prism",3),startangle = 90,
wedgeprops={"linewidth":1,"edgecolor":"white"},shadow =True)
circ = plt.Circle((0,0),.7,color="white")
plt.gca().add_artist(circ)
plt.ylabel("")
plt.title(i+"-Repayer")
Distribution of amount data
cols = [ 'AMT_INCOME_TOTAL', 'AMT_CREDIT_x','AMT_ANNUITY_x', 'AMT_GOODS_PRICE_x']
length = len(cols)
cs = ["r","b","g","k"]
ax = plt.figure(figsize=(18,18))
ax.set_facecolor("lightgrey")
for i,j,k in itertools.zip_longest(cols,range(length),cs):
plt.subplot(2,2,j+1)
sns.distplot(combined_df[combined_df[i].notnull()][i],color=k)
plt.axvline(combined_df[i].mean(),label = "mean",linestyle="dashed",color="k")
plt.legend(loc="best")
plt.title(i)
plt.subplots_adjust(hspace = .2)
Comparing summary statistics between defaulters and non - defaulters for loan amounts.
import itertools
import seaborn as sns
import matplotlib.pyplot as plt
# Compare summary statistics between defaulters and non-defaulters for loan amounts
df = combined_df.groupby("TARGET")[cols].describe().transpose().reset_index()
df = df[df["level_1"].isin([ 'mean', 'std', 'min', 'max'])]
df_x = df[["level_0","level_1",0]]
df_y = df[["level_0","level_1",1]]
df_x = df_x.rename(columns={'level_0':"amount_type", 'level_1':"statistic", 0:"amount"})
df_x["type"] = "REPAYER"
df_y = df_y.rename(columns={'level_0':"amount_type", 'level_1':"statistic", 1:"amount"})
df_y["type"] = "DEFAULTER"
df_new = pd.concat([df_x, df_y], axis=0)
# Get unique statistics to plot
stat = df_new["statistic"].unique().tolist()
length = len(stat)
plt.figure(figsize=(20, 20))
# Create subplots for each statistic
for i, j in itertools.zip_longest(stat, range(length)):
plt.subplot(2, 2, j + 1)
# Use keyword arguments for sns.barplot
fig = sns.barplot(x="amount_type", y="amount", data=df_new[df_new["statistic"] == i],
hue="type", palette=["g", "r"])
plt.title(f"{i} -- Defaulters vs Non-defaulters")
plt.subplots_adjust(hspace=0.4)
fig.set_facecolor("lightgrey")
plt.show()
Income of client
1 . Average income of clients who default and who do not are almost same.
2 . Standard deviation in income of client who default is very high compared to who do not default.
3 . Clients who default also has maximum income earnings
Credit amount of the loan ,Loan annuity,Amount goods price - 1 . Statistics between credit amounts,Loan annuity and Amount goods price given to cilents who default and who dont are almost similar.
Average Income,credit,annuity & goods_price by gender
import seaborn as sns
import matplotlib.pyplot as plt
# Define the columns for the data
cols = ['AMT_INCOME_TOTAL', 'AMT_CREDIT_x', 'AMT_ANNUITY_x', 'AMT_GOODS_PRICE_x']
# Calculate the mean for each gender group
df1 = combined_df.groupby("CODE_GENDER")[cols].mean().transpose().reset_index()
# Separate data for each gender
df_f = df1[["index", "F"]]
df_f = df_f.rename(columns={'index': "amt_type", 'F': "amount"})
df_f["gender"] = "FEMALE"
df_m = df1[["index", "M"]]
df_m = df_m.rename(columns={'index': "amt_type", 'M': "amount"})
df_m["gender"] = "MALE"
df_xna = df1[["index", "XNA"]]
df_xna = df_xna.rename(columns={'index': "amt_type", 'XNA': "amount"})
df_xna["gender"] = "XNA"
# Combine all gender data into a single DataFrame
df_gen = pd.concat([df_m, df_f, df_xna], axis=0)
# Create the barplot with correct keyword arguments
plt.figure(figsize=(12, 5))
ax = sns.barplot(x="amt_type", y="amount", data=df_gen, hue="gender", palette="Set1")
plt.title("Average Income, Credit, Annuity & Goods Price by Gender")
plt.show()
Scatter plot between credit amount and annuity amount
fig = plt.figure(figsize=(10,8))
plt.scatter(combined_df[combined_df["TARGET"]==0]['AMT_ANNUITY_x'],combined_df[combined_df["TARGET"]==0]['AMT_CREDIT_x'],s=35,
color="b",alpha=.5,label="REPAYER",linewidth=.5,edgecolor="k")
plt.scatter(combined_df[combined_df["TARGET"]==1]['AMT_ANNUITY_x'],combined_df[combined_df["TARGET"]==1]['AMT_CREDIT_x'],s=35,
color="r",alpha=.2,label="DEFAULTER",linewidth=.5,edgecolor="k")
plt.legend(loc="best",prop={"size":15})
plt.xlabel("AMT_ANNUITY")
plt.ylabel("AMT_CREDIT")
plt.title("Scatter plot between credit amount and annuity amount")
plt.show()
Pair Plot between amount variables
AMT_INCOME_TOTAL - Income of the client
AMT_CREDIT - Credit amount of the loan
AMT_ANNUITY - Loan annuity
AMT_GOODS_PRICE - For consumer loans it is the price of the goods for which the loan is given
amt = combined_df[[ 'AMT_INCOME_TOTAL','AMT_CREDIT_x',
'AMT_ANNUITY_x', 'AMT_GOODS_PRICE_x',"TARGET"]]
amt = amt[(amt["AMT_GOODS_PRICE_x"].notnull()) & (amt["AMT_ANNUITY_x"].notnull())]
sns.pairplot(amt,hue="TARGET",palette=["b","r"])
plt.show()
Distribution of Suite type
NAME_TYPE_SUITE - Who was accompanying client when he was applying for the loan.
plt.figure(figsize=(18,12))
plt.subplot(121)
sns.countplot(y=combined_df["NAME_TYPE_SUITE_x"],
palette="Set2",
order=combined_df["NAME_TYPE_SUITE_x"].value_counts().index[:5])
plt.title("Distribution of Suite type")
plt.subplot(122)
sns.countplot(y=combined_df["NAME_TYPE_SUITE_x"],
hue=combined_df["CODE_GENDER"],palette="Set2",
order=combined_df["NAME_TYPE_SUITE_x"].value_counts().index[:5])
plt.ylabel("")
plt.title("Distribution of Suite type by gender")
plt.subplots_adjust(wspace = .4)
Distribution of client income type
NAME_INCOME_TYPE Clients income type (businessman, working, maternity leave,…)
plt.figure(figsize=(18,12))
plt.subplot(121)
sns.countplot(y=combined_df["NAME_INCOME_TYPE"],
palette="Set2",
order=combined_df["NAME_INCOME_TYPE"].value_counts().index[:4])
plt.title("Distribution of client income type")
plt.subplot(122)
sns.countplot(y=combined_df["NAME_INCOME_TYPE"],
hue=combined_df["CODE_GENDER"],
palette="Set2",
order=combined_df["NAME_INCOME_TYPE"].value_counts().index[:4])
plt.ylabel("")
plt.title("Distribution of client income type by gender")
plt.subplots_adjust(wspace = .4)
Average Earnings by different professions and education types
import seaborn as sns
import matplotlib.pyplot as plt
# Group by education type and income type, then calculate the mean income
edu = combined_df.groupby(['NAME_EDUCATION_TYPE', 'NAME_INCOME_TYPE'])['AMT_INCOME_TOTAL'].mean().reset_index().sort_values(by='AMT_INCOME_TOTAL', ascending=False)
# Create the plot
fig = plt.figure(figsize=(13, 7))
ax = sns.barplot(x='NAME_INCOME_TYPE', y='AMT_INCOME_TOTAL', data=edu, hue='NAME_EDUCATION_TYPE', palette="seismic")
# Set the background color and title
ax.set_facecolor("k")
plt.title("Average Earnings by Different Professions and Education Types")
# Show the plot
plt.show()
Documents provided by the clients.
FLAG_DOCUMENT - Did client provide documents.(1,0)
import seaborn as sns
import matplotlib.pyplot as plt
import itertools
# Replace target values with labels for better readability
df_flag["TARGET"] = df_flag["TARGET"].replace({1: "defaulter", 0: "repayer"})
# Set up the figure with subplots
fig = plt.figure(figsize=(13, 24))
fig.set_facecolor("lightgrey")
# Loop over the columns and create count plots for each document flag
for i, j in itertools.zip_longest(cols, range(len(cols))):
plt.subplot(5, 4, j + 1)
ax = sns.countplot(x=df_flag[i], hue=df_flag["TARGET"], palette=["r", "b"]) # Specify x
plt.yticks(fontsize=5)
plt.xlabel("")
plt.title(i)
ax.set_facecolor("k")
# Show the plot
plt.tight_layout()
plt.show()
Total and average amounts applied and credited in previous applications
AMT_APPLICATION-For how much credit did client ask on the previous application. >AMT_CREDIT-Final credit amount on the previous application. This differs from AMT_APPLICATION in a way that the AMT_APPLICATION is the amount for which the client.
import seaborn as sns
import matplotlib.pyplot as plt
# Group by contract type and calculate the mean and sum
mn = combined_df.groupby("NAME_CONTRACT_TYPE_y")[["AMT_APPLICATION","AMT_CREDIT_y"]].mean().stack().reset_index()
tt = combined_df.groupby("NAME_CONTRACT_TYPE_y")[["AMT_APPLICATION","AMT_CREDIT_y"]].sum().stack().reset_index()
# Set up the figure for plotting
fig = plt.figure(figsize=(10, 13))
fig.set_facecolor("ghostwhite")
# Plot average amounts by contract types
plt.subplot(211)
ax = sns.barplot(x="level_1", y=0, data=mn[:6], hue="NAME_CONTRACT_TYPE_y", palette="inferno")
ax.set_facecolor("k")
ax.set_xlabel("Average Amounts")
ax.set_title("Average Amounts by Contract Types")
# Plot total amounts by contract types
plt.subplot(212)
ax1 = sns.barplot(x="level_1", y=0, data=tt[:6], hue="NAME_CONTRACT_TYPE_y", palette="magma")
ax1.set_facecolor("k")
ax1.set_xlabel("Total Amounts")
ax1.set_title("Total Amounts by Contract Types")
# Adjust layout for better spacing
plt.subplots_adjust(hspace=0.2)
# Show the plot
plt.show()
Annuity of previous application
AMT_ANNUITY - Annuity of previous application
plt.figure(figsize=(14,5))
plt.subplot(121)
combined_df.groupby("NAME_CONTRACT_TYPE_y")["AMT_ANNUITY_y"].sum().plot(kind="bar")
plt.xticks(rotation=0)
plt.title("Total annuity amount by contract types in previous applications")
plt.subplot(122)
combined_df.groupby("NAME_CONTRACT_TYPE_y")["AMT_ANNUITY_y"].mean().plot(kind="bar")
plt.title("average annuity amount by contract types in previous applications")
plt.xticks(rotation=0)
plt.show()
Count of application status by application type.
NAME_CONTRACT_TYPE -Contract product type (Cash loan, consumer loan [POS] ,...) of the previous application.
NAME_CONTRACT_STATUS -Contract status (approved, cancelled, ...) of previous application.
ax = pd.crosstab(combined_df["NAME_CONTRACT_TYPE_y"],combined_df["NAME_CONTRACT_STATUS"]).plot(kind="barh",figsize=(10,7),stacked=True)
plt.xticks(rotation =0)
plt.ylabel("count")
plt.title("Count of application status by application type")
ax.set_facecolor("k")
Point to infer from the graph:
Consumer loan applications are most approved loans and cash loans are most cancelled and refused loans.
Percentage of applications accepted,cancelled,refused and unused for different loan purposes.
NAME_CASH_LOAN_PURPOSE - Purpose of the cash loan.
NAME_CONTRACT_STATUS - Contract status (approved, cancelled, ...) of previous application.
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import itertools
# Creating the percentage values for different contract statuses
purpose = pd.crosstab(combined_df["NAME_CASH_LOAN_PURPOSE"], combined_df["NAME_CONTRACT_STATUS"])
# Calculating the percentage for each contract status type
purpose["a"] = (purpose["Approved"] * 100) / (purpose["Approved"] + purpose["Canceled"] + purpose["Refused"] + purpose["Unused offer"])
purpose["c"] = (purpose["Canceled"] * 100) / (purpose["Approved"] + purpose["Canceled"] + purpose["Refused"] + purpose["Unused offer"])
purpose["r"] = (purpose["Refused"] * 100) / (purpose["Approved"] + purpose["Canceled"] + purpose["Refused"] + purpose["Unused offer"])
purpose["u"] = (purpose["Unused offer"] * 100) / (purpose["Approved"] + purpose["Canceled"] + purpose["Refused"] + purpose["Unused offer"])
# Selecting the relevant columns and stacking the data
purpose_new = purpose[["a", "c", "r", "u"]]
purpose_new = purpose_new.stack().reset_index()
purpose_new["NAME_CONTRACT_STATUS"] = purpose_new["NAME_CONTRACT_STATUS"].replace({
"a": "accepted_percentage",
"c": "cancelled_percentage",
"r": "refused_percentage",
"u": "unused_percentage"
})
# Preparing for plotting
lst = purpose_new["NAME_CONTRACT_STATUS"].unique().tolist()
length = len(lst)
cs = ["lime", "orange", "r", "b"]
# Plotting the data
fig = plt.figure(figsize=(14, 18))
fig.set_facecolor("lightgrey")
# Loop through the unique contract statuses and plot the data
for i, j, k in itertools.zip_longest(lst, range(length), cs):
plt.subplot(2, 2, j + 1)
dat = purpose_new[purpose_new["NAME_CONTRACT_STATUS"] == i]
ax = sns.barplot(x=0, y="NAME_CASH_LOAN_PURPOSE", data=dat.sort_values(by=0, ascending=False), color=k)
plt.ylabel("")
plt.xlabel("percentage")
plt.title(f"{i} by purpose")
plt.subplots_adjust(wspace=0.7)
ax.set_facecolor("k")
plt.show()
Purposes like XAP ,electronic eqipment ,everey day expences and education have maximum loan acceptance.
Loan puposes like Hobby, payment of other loans ,refusal to name goal ,buying new home or car have most refusals.
37.5% of XNA purpose loans are cancalle
Days variables - Relative to application date of current application
DAYS_FIRST_DRAWING - Relative to application date of current application when was the first disbursement of the previous application.
DAYS_FIRST_DUE - Relative to application date of current application when was the first due supposed to be of the previous application.
DAYS_LAST_DUE_1ST_VERSION - Relative to application date of current application when was the first due of the previous application.
DAYS_LAST_DUE -Relative to application date of current application when was the last due date of the previous application.
DAYS_TERMINATION - Relative to application date of current application when was the expected termination of the previous application.
cols = ['DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION','DAYS_LAST_DUE', 'DAYS_TERMINATION']
plt.figure(figsize=(12,6))
sns.heatmap(combined_df[cols].describe()[1:].transpose(),
annot=True,linewidth=2,linecolor="k",cmap=sns.color_palette("inferno"))
plt.show()
df_repayer = combined_df[combined_df['TARGET'] == 0]
df_defaulter = combined_df[combined_df['TARGET'] == 1]
# Select only numeric columns from the DataFrame
df_repayer_numeric = df_repayer.select_dtypes(include=[np.number])
# Compute the correlation matrix
corrmat = df_repayer_numeric.corr()
# Mask the upper triangle of the correlation matrix (to avoid redundant values)
corrdf = corrmat.where(np.triu(np.ones(corrmat.shape), k=1).astype(bool))
# Unstack the correlation matrix and reset the index to make it easier to work with
corrdf = corrdf.unstack().reset_index()
# Rename the columns for clarity
corrdf.columns = ['Var1', 'Var2', 'Correlation']
# Drop rows where the correlation is NaN
corrdf.dropna(subset=['Correlation'], inplace=True)
# Round the correlation values and take the absolute value
corrdf['Correlation'] = round(corrdf['Correlation'], 2)
corrdf['Correlation'] = abs(corrdf['Correlation'])
# Sort the correlations in descending order and show the top 10
corrdf.sort_values(by='Correlation', ascending=False).head(10)
| Var1 | Var2 | Correlation | |
|---|---|---|---|
| 4786 | AMT_GOODS_PRICE_y | AMT_APPLICATION | 1.00 |
| 2278 | OBS_60_CNT_SOCIAL_CIRCLE | OBS_30_CNT_SOCIAL_CIRCLE | 1.00 |
| 984 | FLAG_EMP_PHONE | DAYS_EMPLOYED | 1.00 |
| 4787 | AMT_GOODS_PRICE_y | AMT_CREDIT_y | 0.99 |
| 454 | AMT_GOODS_PRICE_x | AMT_CREDIT_x | 0.99 |
| 4711 | AMT_CREDIT_y | AMT_APPLICATION | 0.97 |
| 1519 | REGION_RATING_CLIENT_W_CITY | REGION_RATING_CLIENT | 0.95 |
| 5547 | DAYS_TERMINATION | DAYS_LAST_DUE | 0.93 |
| 1352 | CNT_FAM_MEMBERS | CNT_CHILDREN | 0.88 |
| 1823 | LIVE_REGION_NOT_WORK_REGION | REG_REGION_NOT_WORK_REGION | 0.87 |
# Select only numeric columns from the df_defaulter DataFrame
df_defaulter_numeric = df_defaulter.select_dtypes(include=[np.number])
# Compute the correlation matrix for the numeric columns
corrmat = df_defaulter_numeric.corr()
# Mask the upper triangle of the correlation matrix (to avoid redundant values)
corrdf = corrmat.where(np.triu(np.ones(corrmat.shape), k=1).astype(bool))
# Unstack the correlation matrix and reset the index to make it easier to work with
corrdf = corrdf.unstack().reset_index()
# Rename the columns for clarity
corrdf.columns = ['Var1', 'Var2', 'Correlation']
# Drop rows where the correlation is NaN
corrdf.dropna(subset=['Correlation'], inplace=True)
# Round the correlation values and take the absolute value
corrdf['Correlation'] = round(corrdf['Correlation'], 2)
corrdf['Correlation'] = abs(corrdf['Correlation'])
# Sort the correlations in descending order and show the top 10
corrdf.sort_values(by='Correlation', ascending=False).head(10)
| Var1 | Var2 | Correlation | |
|---|---|---|---|
| 4786 | AMT_GOODS_PRICE_y | AMT_APPLICATION | 1.00 |
| 984 | FLAG_EMP_PHONE | DAYS_EMPLOYED | 1.00 |
| 2278 | OBS_60_CNT_SOCIAL_CIRCLE | OBS_30_CNT_SOCIAL_CIRCLE | 1.00 |
| 4787 | AMT_GOODS_PRICE_y | AMT_CREDIT_y | 0.99 |
| 454 | AMT_GOODS_PRICE_x | AMT_CREDIT_x | 0.98 |
| 4711 | AMT_CREDIT_y | AMT_APPLICATION | 0.97 |
| 1519 | REGION_RATING_CLIENT_W_CITY | REGION_RATING_CLIENT | 0.96 |
| 5547 | DAYS_TERMINATION | DAYS_LAST_DUE | 0.94 |
| 5394 | DAYS_LAST_DUE_1ST_VERSION | DAYS_FIRST_DRAWING | 0.90 |
| 1352 | CNT_FAM_MEMBERS | CNT_CHILDREN | 0.89 |